Faster raw log data deletion #14844

tsteur · 2019-09-02T21:10:37Z

This improves the performance for raw data deletion on the log tables further after already making partially suer the index is used in #14840

Before this change, when we delete raw data from log tables, queries like these would be executed:

SELECT idvisit FROM `log_visit` WHERE idvisit > 290 AND visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

SELECT idvisit FROM `log_visit` WHERE idvisit > 1290 AND  visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

Where idvisit > 2290 ...
...

However, this means for raw data deletion, MySQL needs to look at each visit within that time range , read the idvisit, store them in memory or tmp table, and order them afterwards. That's not quite efficient when we are deleting because we could simply always execute the same query:

SELECT idvisit FROM `log_visit` WHERE visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) LIMIT 1000

Now Mysql can just randomly look only at 1000 visits which is fast because of the idsite, visit_last_action_time index instead of looking at potentially many millions of visits and sorting them etc. It's a lot more efficient and reduces IO quite a bit. Especially considering we're executing this query VERY often when there are millions of visits to be deleted where we looked say 10,000 times over all visits in that time range just to delete 10M visits (10,000 times executing the query to delete 1000 visits each time).

Noticed the forAllLogs() method is also called from VisitorGeolocator where it is not deleting data and therefore it actually needs the idvisit > ? ORDER BY idvisit logic as it otherwise can't make sure to apply the callback logic to each method.

Did my best to keep this logic difference somewhat simple. Would otherwise need two different methods for this but that doesn't make things better in the end.

tsteur · 2019-09-02T21:30:24Z

core/DataAccess/RawLogDao.php

        $lastId = 0;
-        if ($useReader) {
-            $db = Db::getReader();


we can no longer use the reader for this but that's fine as the query is now fast.

If we were still using the reader, we would risk reading visits again that were already deleted on the master but the delete was not yet replicated to the reader...

Faster raw log data deletion

b8500ba

tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Sep 2, 2019

tsteur added this to the 3.12.0 milestone Sep 2, 2019

tsteur added 2 commits September 3, 2019 09:22

cannot use DB reader with this new performance feature

a713869

delete 2K visits at once instead of only 1K

19299aa

tsteur commented Sep 2, 2019

View reviewed changes

tsteur added 2 commits September 3, 2019 10:23

fix index not defined

37f94de

forgot to commit file

7ad1711

tsteur added the Needs Review PRs that need a code review label Sep 3, 2019

diosmosis merged commit 17cc46f into 3.x-dev Sep 3, 2019

diosmosis deleted the rawlogdeletetweak branch September 3, 2019 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster raw log data deletion #14844

Faster raw log data deletion #14844

tsteur commented Sep 2, 2019

tsteur Sep 2, 2019

Faster raw log data deletion #14844

Faster raw log data deletion #14844

Conversation

tsteur commented Sep 2, 2019

tsteur Sep 2, 2019

Choose a reason for hiding this comment