@tsteur opened this Issue on December 7th 2020 Member

Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.

Queries like this will become very slow:

explain extended SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_seconds_since_first, visitor_seconds_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, profilable, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, referer_url, config_browser_name, config_client_type, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, campaign_content, campaign_group, campaign_id, campaign_keyword, campaign_medium, campaign_name, campaign_placement, campaign_source, last_idlink_va, custom_dimension_1, custom_dimension_2, custom_dimension_3, custom_dimension_4, custom_dimension_5, custom_dimension_6, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '4' AND visit_last_action_time <= '2020-12-07 20:34:48' AND idvisitor = '...'
                ORDER BY visit_last_action_time DESC
                LIMIT 1

Explain:
image

or
image

To workaround this performance issue set [Tracker]enable_userid_overwrites_visitorid=0 in your config/config.ini.php if userId is used. If userId is not used, there is currently no workaround.

Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where visit_last_action_time > 1 year before upper limit. It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.

We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.

Any other thoughts on how to make this use case faster be appreciated

@tsteur commented on December 7th 2020 Member

The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically.

@tsteur commented on December 7th 2020 Member

A solution might be otherwise a custom index.

@tsteur commented on March 1st 2021 Member

Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request.

@diosmosis commented on March 1st 2021 Member

@tsteur

To make cohorts still useful, how far back would we need to look? 1 year?

This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months.

Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query?

@tsteur commented on March 1st 2021 Member

In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look.

@diosmosis commented on March 1st 2021 Member

Maybe it's the index being updated constantly? Just a guess.

@tsteur commented on March 1st 2021 Member

@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially.

@diosmosis commented on March 1st 2021 Member

Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there.

@tsteur commented on March 1st 2021 Member

I'm not sure if it would help because

  • you might have heaps of tracker instances and each instance would need its own cache plus they would need to have a way to invalidate each other's cache otherwise there might be outdated information
  • when tracking heaps of traffic then the caches could be quite big. This could be solved though by having a custom nested directory structure for this to make sure to have one cache file per eg configId/visitorId and to never have too many files in one directory (otherwise things get slow)
@tsteur commented on March 2nd 2021 Member

FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):

                drop index `index_idsite_idvisitor` on log_visit;
                CREATE INDEX `index_idsite_idvisitor` ON log_visit (idsite, idvisitor, visit_last_action_time DESC)

In my case this improved the performance from 0.13436800s to 0.00180650s

@tsteur commented on March 2nd 2021 Member

We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so.

@diosmosis commented on June 2nd 2021 Member

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

@tsteur commented on June 2nd 2021 Member

One of the problems that makes this query so hard to optimise is the requirement with the visit_last_action_time because we need to find the correct last visit before that timestamp. Maybe, if the tracking request doesn't provide a custom timestamp we could think about optimizing this query. This would be the case for most of the requests. In that case we would be mostly looking for the most recent match. Possible though that this still might not be 100% possible because visits might be out of order or it might require an index update etc.

@tsteur commented on June 2nd 2021 Member

@tsteur I wonder if we can do something like what's done w/ the Live plugin to avoid having to look over too many rows at the same time. For example, we look in the last 30 minutes for a idvisitor, then the last hour, then the last day, and so on until we find a previous visit. There would be more queries, but perhaps the queries would be faster? Just an idea.

Was wondering the same but I don't think it would help too much maybe because we're using the idvisitor index here and it means it would still need to look at all the visits. It would depend if the sorting is the slow CPU consuming part, then it might help. Although often we then might need to select twice which may make it overall a bit slower again maybe.

@diosmosis commented on June 2nd 2021 Member

Is this something we can test or is high concurrency a contributing factor and it's only slow when we're actively tracking?

@tsteur commented on June 2nd 2021 Member

fyi @diosmosis I've profiled the query and here's the output for a visitor with 20K visits. The big part is indeed creating the sort index. The where was WHERE idsite = '3' AND visit_last_action_time <= '2021-06-04 19:31:39'

image

I've then adjusted the where to only look at the last 2 weeks (WHERE idsite = '3' AND visit_last_action_time > '2021-05-20 19:31:39' AND visit_last_action_time <= '2021-06-04 19:31:39'). The sorting took longer for some reason. I executed it multiple times and it was always slower and took consistently 0.07*.

image

I then executed the first query again where I look at all visits. It took consistently 0.065 and was always slightly faster surprisingly. Would need to check what is all part of Creating sort index and whether it includes the where or something.

I then ran the query without any time in the where and it didn't really make a difference (used WHERE idsite = '3' AND idvisitor = <a class='mention' href='https://github.com/v1'>@v1</a> ORDER BY visit_last_action_time DESC).

Then used ORDER BY idsite, visit_last_action_time DESC (as there is such an index) and it seemed slightly faster with 0.06 but not sure it's actually faster.

To have reliable numbers would need to execute all the queries many times and look at like 95th percentile or so.

Having no order clause made the sorting part in the profile disappear (makes sense) but of course sending data took ages

image

It should mean though maybe the less entries we select the faster the sorting should be.

Where looking eg first only at least week is in the end faster or not really depends on the workload. If most visitors have say < 100 visits then it might make it slower. Also depends how much time visits usually have between visits etc. So overall it really almost depends on a per site basis and what kind of data we track. Probably changing it would make it worse for most users since there a visitor typically has only a few visits and often they don't visit for a week but then again after few weeks.

@tsteur commented on June 2nd 2021 Member

Another option would be sorting in PHP as PHP can be scaled easily

SELECT SQL_NO_CACHE visit_last_action_time, idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = ...

again when there are thousands of visits the problem might be though that we'll be sending and transferring a lot of data.

So far still can only think of adjusting the index as mentioned in https://github.com/matomo-org/matomo/issues/16904#issuecomment-789273870 but this of course affects all writes.

@tsteur commented on June 2nd 2021 Member

fyi also tested below query thinking maybe it improves as there is less data to sort or so but it's not improving the performance


SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = <a class='mention' href='https://github.com/v1'>@v1</a>  ORDER BY idsite, visit_last_action_time DESC
                LIMIT 1
@tsteur commented on June 2nd 2021 Member

What did make the query fast is below query. But this will only work if the visits are actually in order and the highest idivist is the most recent visit

SELECT SQL_NO_CACHE idvisit  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '3'  AND visit_last_action_time <= '2021-06-04 19:31:39' AND idvisitor = <a class='mention' href='https://github.com/v1'>@v1</a>  ORDER BY  idvisit DESC
                LIMIT 1

image

@tsteur commented on June 2nd 2021 Member

If I see this right: Basically, if we could cache whether they ever used a custom timestamp for a site during tracking or not then we could optimise the query big big times by sorting idvisit desc instead. As soon as they use a custom timestamp in a tracking request we would be no longer able to use it though and fall back to slow. Most users would maybe not send a custom timestamp.

@tsteur commented on June 4th 2021 Member

fyi tested adding an index idsite, visit_last_action_time DESC but this seems to not have helped.

@tsteur commented on June 23rd 2021 Member

We can't reproduce the performance improvement anymore when sorting by idvisit instead of visit_last_action_time so this doesn't seem to be a solution. For more information see https://github.com/matomo-org/matomo/pull/17649#issuecomment-867210184

@tsteur commented on July 22nd 2021 Member

Let's document this problem and the solution. Once done, let's close it as wontfix.

@tsteur commented on October 7th 2021 Member

@peterhashair I think you documented this one? Is it fully published?

@peterhashair commented on October 7th 2021 Contributor

@tsteur just published it. This issue Should be good to go.

@justinvelluppillai commented on October 7th 2021 Contributor

@peterhashair can you please post a link to the new document and close this issue?

@peterhashair commented on October 7th 2021 Contributor
This Issue was closed on October 7th 2021
Powered by GitHub Issue Mirror