Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if cdt is not used during tracking, sort by idvisit descending when looking for existing visits #17649

Closed
wants to merge 16 commits into from

Conversation

diosmosis
Copy link
Member

Description:

Refs #16904

Sorting by visit_last_action_time when looking for the most recent visit of a visitor can be slow, but sorting by idvisit desc is demonstrably fast. The only problem is, it can only be done when we know that visit rows in the table are in order. That is to say, that the visit_log_action_time values always increase as idvisit increases.

If a user tracks something in the past via the cdt parameter, this no longer becomes the case and we can't perform this optimization. This PR sets an option when cdt is used and, if the option is NOT set, uses the idvisit DESC optimization.

There is also a command that can be used to detect if log_visit is out of order and set the option. It might take a long time to run those queries, so we don't do it automatically or in the included update.

The original problem will still exist for sites where cdt was used.

Review

  • Functional review done
  • Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

@diosmosis diosmosis marked this pull request as draft June 5, 2021 03:51
@tsteur
Copy link
Member

tsteur commented Jun 8, 2021

@diosmosis the approach to detect if cdt is used or not is clever! It would probably not detect it if in the past just for a short time cdt was used or so. To fix that it would mean we would need to check the data for every single visitor which is of course not doable. I reckon for a short test to see how it behaves this could do and maybe later we would need to set a timestamp and only activate the feature after say 1 month so if they use cdt in between then we would detect it during a 1 month period. It still wouldn't detect every case though. Maybe it be best to only enable it for new users just to avoid hard to troubleshoot issues later.

We could test it though for some users to see what impact it has and then decide if it's worth thinking more about the upgrade. Eg on cloud we can define say top 30 DBs and evaluate where to enable it manually maybe.

@diosmosis
Copy link
Member Author

diosmosis commented Jun 8, 2021

@tsteur

It would probably not detect it if in the past just for a short time cdt was used or so.

I guess we could also check visit_first_action_time, but then we definitely wouldn't use an index.

Maybe it be best to only enable it for new users just to avoid hard to troubleshoot issues later.

Well, technically we're not actually enabling it for old users. They all have it disabled. The command there is just for users to use it manually, should they want to (I guess we'll have to write a faq).

We could test it though for some users to see what impact it has and then decide if it's worth thinking more about the upgrade. Eg on cloud we can define say top 30 DBs and evaluate where to enable it manually maybe.

👍 I think I could put a dry-run option into the command then we could run it on a single site and see how long it takes to run. This way we could see if it's do-able in an update.

@tsteur
Copy link
Member

tsteur commented Jun 9, 2021

@diosmosis not sure I fully understand but would we maybe need to set has_used_cdt_when_tracking option to 1 during the upgrade as otherwise it would be empty by default?

We could then patch this on cloud and activate this behaviour for few special accounts to see how it behaves.

@diosmosis
Copy link
Member Author

@tsteur

not sure I fully understand but would we maybe need to set has_used_cdt_when_tracking option to 1 during the upgrade as otherwise it would be empty by default?

This should be in the PR unless I forgot to commit a file.

@tsteur
Copy link
Member

tsteur commented Jun 10, 2021

It's there, seeing it now. Somehow missed it before not sure why. 👍 I reckon that be great to patch on the cloud and then we see from there re the impact.

@diosmosis diosmosis added the Needs Review PRs that need a code review label Jun 10, 2021
@diosmosis diosmosis added this to the 4.4.0 milestone Jun 10, 2021
@diosmosis diosmosis marked this pull request as ready for review June 10, 2021 03:49
@tsteur
Copy link
Member

tsteur commented Jun 14, 2021

fyi not too important just now but after some performance testing. I was thinking generally when someone uses cdt then we can still use this optimisation as long as the visits stay in order. Like if someone was to import visits via log importer and imports new logs every day then this be fine.

Maybe later we can disable this optimisation when:

  • A new visit is being created
  • And there was a previous visit (meaning we have the time visit_last_action_time)
  • And if the last visit action time is newer than the current time then we need to disable this optimisation logic.

Because there may be race conditions on occasion where Matomo might create 2 visits in short time we would probably only disable this performance optimisation if the previously recorded visit_last_action_time is more than 5 minutes newer or so.

To be checked if this would actually work (not sure I thought it through fully). This way the idea would be that we can keep the optimisation enabled more often maybe and it's more accurate as cdt usage doesn't mean that visitIds will be out of order.

Mostly visitIds will be out of order if someone is using the offline tracking feature in JS tracker or eg in iOS where a user comes online later on a device for example and then older requests will be tracked (eg up to 24 hours back without needing a token).

@diosmosis
Copy link
Member Author

@tsteur that sounds like it could be built on top of this PR. I was also thinking of trying the idvisit range idea when cdt has been used (check last 1000 idvisits, then 1000-5000, then 5000-10000, then just look w/ visit_last_action_time). I'll create a new issue after this is merged.

@diosmosis
Copy link
Member Author

Ready for another review

Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look through the code changes and so far it looks fine.
I guess it's on purpose that the hasVisitDataOutOfOrder method is actually quite inaccurate by now. It can easily be the case that the upper or lower half of the selected visits might be unordered, but still below/above the middle visit.

@diosmosis
Copy link
Member Author

@sgiehl

I guess it's on purpose that the hasVisitDataOutOfOrder method is actually quite inaccurate by now. It can easily be the case that the upper or lower half of the selected visits might be unordered, but still below/above the middle visit.

I'm not sure what you mean by this? It should check, for instance, if there is an idvisit > than the middle visit, but last action time < than the middle visit. Can you provide a test case that doesn't work?

@sgiehl
Copy link
Member

sgiehl commented Jun 15, 2021

Something like this:

idvisit visit_last_action_time
1 2021-05-01 10:00:00
2 2021-05-02 10:00:00
3 2021-05-03 10:00:00
4 2021-05-04 10:00:00
5 2021-05-05 10:00:00
6 2021-05-07 10:00:00
7 2021-05-06 10:00:00

The last two records are actually not in order, but still all after the one in the middle

@diosmosis
Copy link
Member Author

Oh, I see, yes, that's unfortunate. I'll have to rethink that then.

@tsteur
Copy link
Member

tsteur commented Jun 15, 2021

@diosmosis it's fine we could ignore this for now as mentioned in slack the only option would be to check for every single visitor whether the visits are in order which is not doable. Because we don't run the command by default it wouldn't be a problem. It be then fixed for new installs and it would provide later an option for instances that have a performance problem with this query which is great.

@github-actions
Copy link
Contributor

This issue is in "needs review" but there has been no activity for 7 days. ping @tsteur @sgiehl @diosmosis @flamisz

@github-actions github-actions bot added the Stale The label used by the Close Stale Issues action label Jun 23, 2021
@sgiehl
Copy link
Member

sgiehl commented Jun 23, 2021

@tsteur @diosmosis wondering if it wouldn't be enough to have a command that simply sets the option to 0. If the command says clearly that it should only be done if no visits had been imported (for the log data that is still in the database) it might maybe solve the needs without too many logic.

@tsteur
Copy link
Member

tsteur commented Jun 23, 2021

fyi I've done some more testing on different installs etc. Actually in those other installs I couldn't reproduce this performance improvement anymore and also not anymore where I was reproducing this originally (even with the same visit). Possible a MySQL update changed something or the data structure or something changed. I'll close this for now as it doesn't look like this will help us and isn't worth any risk of potentially having not accurate data etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review Stale The label used by the Close Stale Issues action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants