New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
if cdt is not used during tracking, sort by idvisit descending when looking for existing visits #17649
Conversation
…e verified if needed
@diosmosis the approach to detect if cdt is used or not is clever! It would probably not detect it if in the past just for a short time cdt was used or so. To fix that it would mean we would need to check the data for every single visitor which is of course not doable. I reckon for a short test to see how it behaves this could do and maybe later we would need to set a timestamp and only activate the feature after say 1 month so if they use cdt in between then we would detect it during a 1 month period. It still wouldn't detect every case though. Maybe it be best to only enable it for new users just to avoid hard to troubleshoot issues later. We could test it though for some users to see what impact it has and then decide if it's worth thinking more about the upgrade. Eg on cloud we can define say top 30 DBs and evaluate where to enable it manually maybe. |
I guess we could also check visit_first_action_time, but then we definitely wouldn't use an index.
Well, technically we're not actually enabling it for old users. They all have it disabled. The command there is just for users to use it manually, should they want to (I guess we'll have to write a faq).
👍 I think I could put a dry-run option into the command then we could run it on a single site and see how long it takes to run. This way we could see if it's do-able in an update. |
@diosmosis not sure I fully understand but would we maybe need to set We could then patch this on cloud and activate this behaviour for few special accounts to see how it behaves. |
This should be in the PR unless I forgot to commit a file. |
It's there, seeing it now. Somehow missed it before not sure why. 👍 I reckon that be great to patch on the cloud and then we see from there re the impact. |
fyi not too important just now but after some performance testing. I was thinking generally when someone uses cdt then we can still use this optimisation as long as the visits stay in order. Like if someone was to import visits via log importer and imports new logs every day then this be fine. Maybe later we can disable this optimisation when:
Because there may be race conditions on occasion where Matomo might create 2 visits in short time we would probably only disable this performance optimisation if the previously recorded To be checked if this would actually work (not sure I thought it through fully). This way the idea would be that we can keep the optimisation enabled more often maybe and it's more accurate as cdt usage doesn't mean that visitIds will be out of order. Mostly visitIds will be out of order if someone is using the offline tracking feature in JS tracker or eg in iOS where a user comes online later on a device for example and then older requests will be tracked (eg up to 24 hours back without needing a token). |
@tsteur that sounds like it could be built on top of this PR. I was also thinking of trying the idvisit range idea when cdt has been used (check last 1000 idvisits, then 1000-5000, then 5000-10000, then just look w/ visit_last_action_time). I'll create a new issue after this is merged. |
Ready for another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look through the code changes and so far it looks fine.
I guess it's on purpose that the hasVisitDataOutOfOrder
method is actually quite inaccurate by now. It can easily be the case that the upper or lower half of the selected visits might be unordered, but still below/above the middle visit.
I'm not sure what you mean by this? It should check, for instance, if there is an idvisit > than the middle visit, but last action time < than the middle visit. Can you provide a test case that doesn't work? |
Something like this:
The last two records are actually not in order, but still all after the one in the middle |
Oh, I see, yes, that's unfortunate. I'll have to rethink that then. |
@diosmosis it's fine we could ignore this for now as mentioned in slack the only option would be to check for every single visitor whether the visits are in order which is not doable. Because we don't run the command by default it wouldn't be a problem. It be then fixed for new installs and it would provide later an option for instances that have a performance problem with this query which is great. |
This issue is in "needs review" but there has been no activity for 7 days. ping @tsteur @sgiehl @diosmosis @flamisz |
@tsteur @diosmosis wondering if it wouldn't be enough to have a command that simply sets the option to |
fyi I've done some more testing on different installs etc. Actually in those other installs I couldn't reproduce this performance improvement anymore and also not anymore where I was reproducing this originally (even with the same visit). Possible a MySQL update changed something or the data structure or something changed. I'll close this for now as it doesn't look like this will help us and isn't worth any risk of potentially having not accurate data etc. |
Description:
Refs #16904
Sorting by visit_last_action_time when looking for the most recent visit of a visitor can be slow, but sorting by idvisit desc is demonstrably fast. The only problem is, it can only be done when we know that visit rows in the table are in order. That is to say, that the visit_log_action_time values always increase as idvisit increases.
If a user tracks something in the past via the
cdt
parameter, this no longer becomes the case and we can't perform this optimization. This PR sets an option whencdt
is used and, if the option is NOT set, uses theidvisit DESC
optimization.There is also a command that can be used to detect if log_visit is out of order and set the option. It might take a long time to run those queries, so we don't do it automatically or in the included update.
The original problem will still exist for sites where
cdt
was used.Review