@tsteur opened this Pull Request on September 2nd 2019 Member

This improves the performance for raw data deletion on the log tables further after already making partially suer the index is used in https://github.com/matomo-org/matomo/pull/14840

Before this change, when we delete raw data from log tables, queries like these would be executed:

SELECT idvisit FROM `log_visit` WHERE idvisit > 290 AND visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

SELECT idvisit FROM `log_visit` WHERE idvisit > 1290 AND  visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

Where idvisit > 2290 ...

However, this means for raw data deletion, MySQL needs to look at each visit within that time range , read the idvisit, store them in memory or tmp table, and order them afterwards. That's not quite efficient when we are deleting because we could simply always execute the same query:

SELECT idvisit FROM `log_visit` WHERE visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) LIMIT 1000

Now Mysql can just randomly look only at 1000 visits which is fast because of the idsite, visit_last_action_time index instead of looking at potentially many millions of visits and sorting them etc. It's a lot more efficient and reduces IO quite a bit. Especially considering we're executing this query VERY often when there are millions of visits to be deleted where we looked say 10,000 times over all visits in that time range just to delete 10M visits (10,000 times executing the query to delete 1000 visits each time).

Noticed the forAllLogs() method is also called from VisitorGeolocator where it is not deleting data and therefore it actually needs the idvisit > ? ORDER BY idvisit logic as it otherwise can't make sure to apply the callback logic to each method.

Did my best to keep this logic difference somewhat simple. Would otherwise need two different methods for this but that doesn't make things better in the end.

This Pull Request was closed on September 3rd 2019
Powered by GitHub Issue Mirror