Say you are tracking apps and you send a userId with every tracking request. Then by default the userId will be used to generate the visitorId. It may also be a problem if for every action a new visit is forced.
Queries like this will become very slow:
explain extended SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_seconds_since_first, visitor_seconds_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, profilable, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, referer_url, config_browser_name, config_client_type, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, campaign_content, campaign_group, campaign_id, campaign_keyword, campaign_medium, campaign_name, campaign_placement, campaign_source, last_idlink_va, custom_dimension_1, custom_dimension_2, custom_dimension_3, custom_dimension_4, custom_dimension_5, custom_dimension_6, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5 FROM log_visit FORCE INDEX (index_idsite_idvisitor) WHERE idsite = '4' AND visit_last_action_time <= '2020-12-07 20:34:48' AND idvisitor = '...' ORDER BY visit_last_action_time DESC LIMIT 1
To workaround this performance issue set
[Tracker]enable_userid_overwrites_visitorid=1 in your
config/config.ini.php if userId is used. If userId is not used, there is currently no workaround.
Ideally we find a way to not needing this setting though. Maybe we could at least limit time range by adding to the where
visit_last_action_time > 1 year before upper limit. It won't make it much better but at least it be slightly faster especially when the table has over the years collected many hundred thousands or millions of requests for the same visitor. In our use case we have for example for the same idVisitor 1.6M entries which is roughly 30% of all visits for the same visitor.
We used to have this lower limit in the where clause but AFAIK we removed it for Cohorts. @diosmosis do you maybe remember? To make cohorts still useful, how far back would we need to look? 1 year? Or would it be fine to only look back 6 months? I suppose we'd need to look back kind of ideally at least 1 year for the reports to be fine when you look at months for cohorts.
Any other thoughts on how to make this use case faster be appreciated
The where clause with one year back might not really help though because most visits might be within one year (seeing such a DB here). Maybe a config setting forcing look back to be only few days might be better but this would need a config. Ideally it was fixed automatically.
A solution might be otherwise a custom index.
Putting this into the milestone to see if we can somehow improve this or workaround it when a visitor has hundreds of thousands of visit to still have it fast. It may not be easily possible and that be fine. Ideally we also wouldn't really make it slower for the normal tracking request.
To make cohorts still useful, how far back would we need to look? 1 year?
This depends on whether you'd want the year period to work w/ cohorts. If not, then it's really a judgement call on the number of months.
Though, I would expect the query to be fast if requests force new visits is always used? Since the last known visit would normally be recent? Unless it's common for visitors to come from years ago? Or does the ordering not get applied in the query?
In our case all visits are recent like last 1 hour. So it's probably unrelated to how far back we look.
Maybe it's the index being updated constantly? Just a guess.
@diosmosis I'm not so much into details right now but it was basically fetching thousands of visits every time so I'm assuming the sorting might be the problem or so I'm not really sure. If you're keen to look into this I can let you know where this happens on slack potentially.
Quick idea: Would a local (not synced) file based cache that just saved the idvisit => idvisitor mapping for 30mins be useful? If less than 30 mins, we use the cached value. If 30 mins passes we run the query again. If forcing a new visit, we unset the cache value if it's there.
I'm not sure if it would help because
FYI one (reactive) workaround that might help is adding a third column to the index idsite idvisitor index like this (not fully tested):
drop index `index_idsite_idvisitor` on log_visit; CREATE INDEX `index_idsite_idvisitor` ON log_visit (idsite, idvisitor, visit_last_action_time DESC)
In my case this improved the performance from
We could also have a setting to block new visits for a specific visitor after for example say 500 visits. However, this might make it hard for people to troubleshoot this issue and sometimes you will want to track that many visits. It's not uncommon that if you have daily visitors over a 1-2 years to reach such a number. So it can't really be a solution. Maybe there could be some upper limit though like 10000 visits or so.