Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matomo tracker select queries can get slow when most visits have same visitorId #16904

Closed
tsteur opened this issue Dec 7, 2020 · 29 comments
Closed
Labels
c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented Dec 7, 2020

Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.

Queries like this will become very slow:

explain extended SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_seconds_since_first, visitor_seconds_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, profilable, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, referer_url, config_browser_name, config_client_type, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, campaign_content, campaign_group, campaign_id, campaign_keyword, campaign_medium, campaign_name, campaign_placement, campaign_source, last_idlink_va, custom_dimension_1, custom_dimension_2, custom_dimension_3, custom_dimension_4, custom_dimension_5, custom_dimension_6, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '4' AND visit_last_action_time <= '2020-12-07 20:34:48' AND idvisitor = '...'
                ORDER BY visit_last_action_time DESC
                LIMIT 1

Explain:
image

or
image

To workaround this performance issue set [Tracker]enable_userid_overwrites_visitorid=0 in your config/config.ini.php if userId is used. If userId is not used, there is currently no workaround.

Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where visit_last_action_time > 1 year before upper limit . It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.

We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.

Any other thoughts on how to make this use case faster be appreciated

UPDATE

See my comment further below. This issue is now about changing the existing index according to https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/

@tsteur tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Dec 7, 2020
@tsteur
Copy link
Member Author

tsteur commented Dec 7, 2020

The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically.

@tsteur
Copy link
Member Author

tsteur commented Dec 7, 2020

A solution might be otherwise a custom index.

@tsteur tsteur added this to the 4.3.0 milestone Mar 1, 2021
@tsteur tsteur added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Mar 1, 2021
@tsteur
Copy link
Member Author

tsteur commented Mar 1, 2021

Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request.

@diosmosis
Copy link
Member

@tsteur

To make cohorts still useful, how far back would we need to look? 1 year?

This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months.

Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query?

@tsteur
Copy link
Member Author

tsteur commented Mar 1, 2021

In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look.

@diosmosis
Copy link
Member

diosmosis commented Mar 1, 2021

Maybe it's the index being updated constantly? Just a guess.

@tsteur
Copy link
Member Author

tsteur commented Mar 1, 2021

@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially.

@diosmosis
Copy link
Member

Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there.

@tsteur
Copy link
Member Author

tsteur commented Mar 1, 2021

I'm not sure if it would help because

  • you might have heaps of tracker instances and each instance would need its own cache plus they would need to have a way to invalidate each other's cache otherwise there might be outdated information
  • when tracking heaps of traffic then the caches could be quite big. This could be solved though by having a custom nested directory structure for this to make sure to have one cache file per eg configId/visitorId and to never have too many files in one directory (otherwise things get slow)

@tsteur
Copy link
Member Author

tsteur commented Mar 2, 2021

FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):

                drop index `index_idsite_idvisitor` on log_visit;
                CREATE INDEX `index_idsite_idvisitor` ON log_visit (idsite, idvisitor, visit_last_action_time DESC)

In my case this improved the performance from 0.13436800s to 0.00180650s

@tsteur
Copy link
Member Author

tsteur commented Mar 2, 2021

We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so.

@diosmosis
Copy link
Member

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

One of the problems that makes this query so hard to optimise is the requirement with the visit_last_action_time because we need to find the correct last visit before that timestamp. Maybe, if the tracking request doesn't provide a custom timestamp we could think about optimizing this query. This would be the case for most of the requests. In that case we would be mostly looking for the most recent match. Possible though that this still might not be 100% possible because visits might be out of order or it might require an index update etc.

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

Was wondering the same but I don't think it would help too much maybe because we're using the idvisitor index here and it means it would still need to look at all the visits. It would depend if the sorting is the slow CPU consuming part, then it might help. Although often we then might need to select twice which may make it overall a bit slower again maybe.

@diosmosis
Copy link
Member

Is this something we can test or is high concurrency a contributing factor and it's only slow when we're actively tracking?

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

fyi @diosmosis I've profiled the query and here's the output for a visitor with 20K visits. The big part is indeed creating the sort index. The where was WHERE idsite = '3' AND visit_last_action_time <= '2021-06-04 19:31:39'

image

I've then adjusted the where to only look at the last 2 weeks (WHERE idsite = '3' AND visit_last_action_time > '2021-05-20 19:31:39' AND visit_last_action_time <= '2021-06-04 19:31:39'). The sorting took longer for some reason. I executed it multiple times and it was always slower and took consistently 0.07*.

image

I then executed the first query again where I look at all visits. It took consistently 0.065 and was always slightly faster surprisingly. Would need to check what is all part of Creating sort index and whether it includes the where or something.

I then ran the query without any time in the where and it didn't really make a difference (used WHERE idsite = '3' AND idvisitor = @v1 ORDER BY visit_last_action_time DESC).

Then used ORDER BY idsite, visit_last_action_time DESC (as there is such an index) and it seemed slightly faster with 0.06 but not sure it's actually faster.

To have reliable numbers would need to execute all the queries many times and look at like 95th percentile or so.

Having no order clause made the sorting part in the profile disappear (makes sense) but of course sending data took ages

image

It should mean though maybe the less entries we select the faster the sorting should be.

Where looking eg first only at least week is in the end faster or not really depends on the workload. If most visitors have say < 100 visits then it might make it slower. Also depends how much time visits usually have between visits etc. So overall it really almost depends on a per site basis and what kind of data we track. Probably changing it would make it worse for most users since there a visitor typically has only a few visits and often they don't visit for a week but then again after few weeks.

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

Another option would be sorting in PHP as PHP can be scaled easily

SELECT SQL_NO_CACHE visit_last_action_time, idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = ...

again when there are thousands of visits the problem might be though that we'll be sending and transferring a lot of data.

So far still can only think of adjusting the index as mentioned in #16904 (comment) but this of course affects all writes.

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

fyi also tested below query thinking maybe it improves as there is less data to sort or so but it's not improving the performance


SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = @v1  ORDER BY idsite, visit_last_action_time DESC
                LIMIT 1

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

What did make the query fast is below query. But this will only work if the visits are actually in order and the highest idivist is the most recent visit

SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = @v1  ORDER BY  idvisit DESC
                LIMIT 1

image

@tsteur
Copy link
Member Author

tsteur commented Jun 2, 2021

If I see this right: Basically, if we could cache whether they ever used a custom timestamp for a site during tracking or not then we could optimise the query big big times by sorting idvisit desc instead. As soon as they use a custom timestamp in a tracking request we would be no longer able to use it though and fall back to slow. Most users would maybe not send a custom timestamp.

@tsteur
Copy link
Member Author

tsteur commented Jun 4, 2021

fyi tested adding an index idsite, visit_last_action_time DESC but this seems to not have helped.

@tsteur
Copy link
Member Author

tsteur commented Jun 23, 2021

We can't reproduce the performance improvement anymore when sorting by idvisit instead of visit_last_action_time so this doesn't seem to be a solution. For more information see #17649 (comment)

@tsteur tsteur added the c: Documentation For issues related to in-app product help messages, or to the Matomo knowledge base. label Jul 22, 2021
@tsteur
Copy link
Member Author

tsteur commented Jul 22, 2021

Let's document this problem and the solution. Once done, let's close it as wontfix.

@tsteur
Copy link
Member Author

tsteur commented Oct 7, 2021

@peterhashair I think you documented this one? Is it fully published?

@peterhashair peterhashair self-assigned this Oct 7, 2021
@peterhashair
Copy link
Contributor

@tsteur just published it. This issue Should be good to go.

@justinvelluppillai
Copy link
Contributor

@peterhashair can you please post a link to the new document and close this issue?

@peterhashair
Copy link
Contributor

@tsteur tsteur closed this as completed Oct 7, 2021
@justinvelluppillai justinvelluppillai added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label Nov 29, 2021
@tsteur tsteur modified the milestones: 4.6.0, 5.0.0 Dec 20, 2021
@tsteur tsteur removed c: Documentation For issues related to in-app product help messages, or to the Matomo knowledge base. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. labels Dec 20, 2021
@tsteur
Copy link
Member Author

tsteur commented Dec 20, 2021

I'll reopen this issue for Matomo 5.0.

In Matomo 5.0 we will change the existing index according to this FAQ: https://matomo.org/faq/troubleshooting/how-do-i-improve-the-load-time-of-the-find-visitor-sql-query-when-visitors-have-many-visits/

so the index looks like this:

`index_idsite_idvisitor` ON matomo_log_visit (idsite, idvisitor, visit_last_action_time DESC)

We've been testing this index on the cloud and while it made writes slightly slower the reads are a lot faster. This query was basically top load contributor previously

@tsteur
Copy link
Member Author

tsteur commented May 10, 2022

This was already solved in #18636 and the update script for Matomo 5 is already merged causing an index change is there already. We only need to close it as part of Matomo 5 release to make sure it appears in changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

6 participants