@tsteur opened this Issue on December 7th 2020 Member

Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.

Queries like this will become very slow:

explain extended SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_seconds_since_first, visitor_seconds_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, profilable, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, referer_url, config_browser_name, config_client_type, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, campaign_content, campaign_group, campaign_id, campaign_keyword, campaign_medium, campaign_name, campaign_placement, campaign_source, last_idlink_va, custom_dimension_1, custom_dimension_2, custom_dimension_3, custom_dimension_4, custom_dimension_5, custom_dimension_6, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5  FROM log_visit FORCE INDEX (index_idsite_idvisitor)  WHERE idsite = '4' AND visit_last_action_time <= '2020-12-07 20:34:48' AND idvisitor = '...'
                ORDER BY visit_last_action_time DESC
                LIMIT 1

Explain:
image

or
image

To workaround this performance issue set [Tracker]enable_userid_overwrites_visitorid=1 in your config/config.ini.php if userId is used. If userId is not used, there is currently no workaround.

Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where visit_last_action_time > 1 year before upper limit. It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.

We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.

Any other thoughts on how to make this use case faster be appreciated

@tsteur commented on December 7th 2020 Member

The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically.

@tsteur commented on December 7th 2020 Member

A solution might be otherwise a custom index.

@tsteur commented on March 1st 2021 Member

Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request.

@diosmosis commented on March 1st 2021 Member

@tsteur

To make cohorts still useful, how far back would we need to look? 1 year?

This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months.

Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query?

@tsteur commented on March 1st 2021 Member

In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look.

@diosmosis commented on March 1st 2021 Member

Maybe it's the index being updated constantly? Just a guess.

@tsteur commented on March 1st 2021 Member

@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially.

@diosmosis commented on March 1st 2021 Member

Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there.

@tsteur commented on March 1st 2021 Member

I'm not sure if it would help because

  • you might have heaps of tracker instances and each instance would need its own cache plus they would need to have a way to invalidate each other's cache otherwise there might be outdated information
  • when tracking heaps of traffic then the caches could be quite big. This could be solved though by having a custom nested directory structure for this to make sure to have one cache file per eg configId/visitorId and to never have too many files in one directory (otherwise things get slow)
@tsteur commented on March 2nd 2021 Member

FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):

                drop index `index_idsite_idvisitor` on log_visit;
                CREATE INDEX `index_idsite_idvisitor` ON log_visit (idsite, idvisitor, visit_last_action_time DESC)

In my case this improved the performance from 0.13436800s to 0.00180650s

@tsteur commented on March 2nd 2021 Member

We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so.

Powered by GitHub Issue Mirror