@diosmosis opened this Pull Request on June 5th 2021 Member

Description:

Refs #16904

Sorting by visit_last_action_time when looking for the most recent visit of a visitor can be slow, but sorting by idvisit desc is demonstrably fast. The only problem is, it can only be done when we know that visit rows in the table are in order. That is to say, that the visit_log_action_time values always increase as idvisit increases.

If a user tracks something in the past via the cdt parameter, this no longer becomes the case and we can't perform this optimization. This PR sets an option when cdt is used and, if the option is NOT set, uses the idvisit DESC optimization.

There is also a command that can be used to detect if log_visit is out of order and set the option. It might take a long time to run those queries, so we don't do it automatically or in the included update.

The original problem will still exist for sites where cdt was used.

Review

  • [ ] Functional review done
  • [ ] Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • [ ] Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • [ ] Security review done see checklist
  • [ ] Code review done
  • [ ] Tests were added if useful/possible
  • [ ] Reviewed for breaking changes
  • [ ] Developer changelog updated if needed
  • [ ] Documentation added if needed
  • [ ] Existing documentation updated if needed
@tsteur commented on June 8th 2021 Member

@diosmosis the approach to detect if cdt is used or not is clever! It would probably not detect it if in the past just for a short time cdt was used or so. To fix that it would mean we would need to check the data for every single visitor which is of course not doable. I reckon for a short test to see how it behaves this could do and maybe later we would need to set a timestamp and only activate the feature after say 1 month so if they use cdt in between then we would detect it during a 1 month period. It still wouldn't detect every case though. Maybe it be best to only enable it for new users just to avoid hard to troubleshoot issues later.

We could test it though for some users to see what impact it has and then decide if it's worth thinking more about the upgrade. Eg on cloud we can define say top 30 DBs and evaluate where to enable it manually maybe.

@diosmosis commented on June 8th 2021 Member

@tsteur

It would probably not detect it if in the past just for a short time cdt was used or so.

I guess we could also check visit_first_action_time, but then we definitely wouldn't use an index.

Maybe it be best to only enable it for new users just to avoid hard to troubleshoot issues later.

Well, technically we're not actually enabling it for old users. They all have it disabled. The command there is just for users to use it manually, should they want to (I guess we'll have to write a faq).

We could test it though for some users to see what impact it has and then decide if it's worth thinking more about the upgrade. Eg on cloud we can define say top 30 DBs and evaluate where to enable it manually maybe.

:+1: I think I could put a dry-run option into the command then we could run it on a single site and see how long it takes to run. This way we could see if it's do-able in an update.

@tsteur commented on June 9th 2021 Member

@diosmosis not sure I fully understand but would we maybe need to set has_used_cdt_when_tracking option to 1 during the upgrade as otherwise it would be empty by default?

We could then patch this on cloud and activate this behaviour for few special accounts to see how it behaves.

@diosmosis commented on June 10th 2021 Member

@tsteur

not sure I fully understand but would we maybe need to set has_used_cdt_when_tracking option to 1 during the upgrade as otherwise it would be empty by default?

This should be in the PR unless I forgot to commit a file.

@tsteur commented on June 10th 2021 Member

It's there, seeing it now. Somehow missed it before not sure why. 👍 I reckon that be great to patch on the cloud and then we see from there re the impact.

@tsteur commented on June 14th 2021 Member

fyi not too important just now but after some performance testing. I was thinking generally when someone uses cdt then we can still use this optimisation as long as the visits stay in order. Like if someone was to import visits via log importer and imports new logs every day then this be fine.

Maybe later we can disable this optimisation when:

  • A new visit is being created
  • And there was a previous visit (meaning we have the time visit_last_action_time)
  • And if the last visit action time is newer than the current time then we need to disable this optimisation logic.

Because there may be race conditions on occasion where Matomo might create 2 visits in short time we would probably only disable this performance optimisation if the previously recorded visit_last_action_time is more than 5 minutes newer or so.

To be checked if this would actually work (not sure I thought it through fully). This way the idea would be that we can keep the optimisation enabled more often maybe and it's more accurate as cdt usage doesn't mean that visitIds will be out of order.

Mostly visitIds will be out of order if someone is using the offline tracking feature in JS tracker or eg in iOS where a user comes online later on a device for example and then older requests will be tracked (eg up to 24 hours back without needing a token).

@diosmosis commented on June 14th 2021 Member

@tsteur that sounds like it could be built on top of this PR. I was also thinking of trying the idvisit range idea when cdt has been used (check last 1000 idvisits, then 1000-5000, then 5000-10000, then just look w/ visit_last_action_time). I'll create a new issue after this is merged.

@diosmosis commented on June 15th 2021 Member

Ready for another review

@diosmosis commented on June 15th 2021 Member

@sgiehl

I guess it's on purpose that the hasVisitDataOutOfOrder method is actually quite inaccurate by now. It can easily be the case that the upper or lower half of the selected visits might be unordered, but still below/above the middle visit.

I'm not sure what you mean by this? It should check, for instance, if there is an idvisit > than the middle visit, but last action time < than the middle visit. Can you provide a test case that doesn't work?

@sgiehl commented on June 15th 2021 Member

Something like this:

idvisit visit_last_action_time
1 2021-05-01 10:00:00
2 2021-05-02 10:00:00
3 2021-05-03 10:00:00
4 2021-05-04 10:00:00
5 2021-05-05 10:00:00
6 2021-05-07 10:00:00
7 2021-05-06 10:00:00

The last two records are actually not in order, but still all after the one in the middle

@diosmosis commented on June 15th 2021 Member

Oh, I see, yes, that's unfortunate. I'll have to rethink that then.

@tsteur commented on June 15th 2021 Member

@diosmosis it's fine we could ignore this for now as mentioned in slack the only option would be to check for every single visitor whether the visits are in order which is not doable. Because we don't run the command by default it wouldn't be a problem. It be then fixed for new installs and it would provide later an option for instances that have a performance problem with this query which is great.

@github-actions[bot] commented on June 23rd 2021 Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @tsteur @sgiehl @diosmosis @flamisz

@sgiehl commented on June 23rd 2021 Member

@tsteur @diosmosis wondering if it wouldn't be enough to have a command that simply sets the option to 0. If the command says clearly that it should only be done if no visits had been imported (for the log data that is still in the database) it might maybe solve the needs without too many logic.

@tsteur commented on June 23rd 2021 Member

fyi I've done some more testing on different installs etc. Actually in those other installs I couldn't reproduce this performance improvement anymore and also not anymore where I was reproducing this originally (even with the same visit). Possible a MySQL update changed something or the data structure or something changed. I'll close this for now as it doesn't look like this will help us and isn't worth any risk of potentially having not accurate data etc.

This Pull Request was closed on June 23rd 2021
Powered by GitHub Issue Mirror