Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster raw log data deletion #14844

Merged
merged 5 commits into from Sep 3, 2019
Merged

Faster raw log data deletion #14844

merged 5 commits into from Sep 3, 2019

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Sep 2, 2019

This improves the performance for raw data deletion on the log tables further after already making partially suer the index is used in #14840

Before this change, when we delete raw data from log tables, queries like these would be executed:

SELECT idvisit FROM `log_visit` WHERE idvisit > 290 AND visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

SELECT idvisit FROM `log_visit` WHERE idvisit > 1290 AND  visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) ORDER BY idvisit ASC LIMIT 1000

Where idvisit > 2290 ...
...

However, this means for raw data deletion, MySQL needs to look at each visit within that time range , read the idvisit, store them in memory or tmp table, and order them afterwards. That's not quite efficient when we are deleting because we could simply always execute the same query:

SELECT idvisit FROM `log_visit` WHERE visit_last_action_time < '2019-06-02 03:04:05' AND idsite IN (1,5,7) LIMIT 1000

Now Mysql can just randomly look only at 1000 visits which is fast because of the idsite, visit_last_action_time index instead of looking at potentially many millions of visits and sorting them etc. It's a lot more efficient and reduces IO quite a bit. Especially considering we're executing this query VERY often when there are millions of visits to be deleted where we looked say 10,000 times over all visits in that time range just to delete 10M visits (10,000 times executing the query to delete 1000 visits each time).

Noticed the forAllLogs() method is also called from VisitorGeolocator where it is not deleting data and therefore it actually needs the idvisit > ? ORDER BY idvisit logic as it otherwise can't make sure to apply the callback logic to each method.

Did my best to keep this logic difference somewhat simple. Would otherwise need two different methods for this but that doesn't make things better in the end.

@tsteur tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Sep 2, 2019
@tsteur tsteur added this to the 3.12.0 milestone Sep 2, 2019
$lastId = 0;
if ($useReader) {
$db = Db::getReader();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can no longer use the reader for this but that's fine as the query is now fast.

If we were still using the reader, we would risk reading visits again that were already deleted on the master but the delete was not yet replicated to the reader...

@tsteur tsteur added the Needs Review PRs that need a code review label Sep 3, 2019
@diosmosis diosmosis merged commit 17cc46f into 3.x-dev Sep 3, 2019
@diosmosis diosmosis deleted the rawlogdeletetweak branch September 3, 2019 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants