@EreMaijala opened this Issue on December 4th 2018

The changes made in #13227 for deleteLogDataForDeletedSites seem to have caused a major performance issue. They result in a query that locks a large Piwik database for a long time, potentially several minutes. As far as I can see the queries will also never accomplish anything, since the where clause in the queries contains "site.idsite IS NULL" (created here: https://github.com/matomo-org/matomo/blob/70505b1ebb5584a8a859bcb9a54e0cb50d66677e/plugins/PrivacyManager/Model/DataSubjects.php#L408), and idsite in the site table or log_visit table cannot be null. The resulting queries look like this:

DELETE log_conversion FROM piwik_log_conversion log_conversion LEFT JOIN piwik_log_visit log_visit ON log_conversion.idvisit = log_visit.idvisit  LEFT JOIN piwik_site site ON site.idsite = log_visit.idsite WHERE site.idsite IS NULL AND log_visit.idsite <= '3953'
DELETE log_conversion_item FROM piwik_log_conversion_item log_conversion_item LEFT JOIN piwik_log_visit log_visit ON log_conversion_item.idvisit = log_visit.idvisit  LEFT JOIN piwik_site site ON site.idsite = log_visit.idsite WHERE site.idsite IS NULL AND log_visit.idsite <= '3953'
DELETE log_link_visit_action FROM piwik_log_link_visit_action log_link_visit_action LEFT JOIN piwik_log_visit log_visit ON log_link_visit_action.idvisit = log_visit.idvisit  LEFT JOIN piwik_site site ON site.idsite = log_visit.idsite WHERE site.idsite IS NULL AND log_visit.idsite <= '3953'
DELETE log_visit FROM piwik_log_visit log_visit LEFT JOIN piwik_site site ON site.idsite = log_visit.idsite WHERE site.idsite IS NULL AND log_visit.idsite <= '3953'

It seems that whatever these queries should be doing is not happening. Instead they're causing tracking to be blocked for a significant time. The query plan for the above queries shows how bad it is since it goes through the whole log_visit table:

+------+-------------+-----------+--------+---------------------------------------------------------------------------+---------+---------+------------------------+----------+-------------------------+
| id   | select_type | table     | type   | possible_keys                                                             | key     | key_len | ref                    | rows     | Extra                   |
+------+-------------+-----------+--------+---------------------------------------------------------------------------+---------+---------+------------------------+----------+-------------------------+
|    1 | SIMPLE      | log_visit | ALL    | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | NULL    | NULL    | NULL                   | 48810128 | Using where             |
|    1 | SIMPLE      | site      | eq_ref | PRIMARY                                                                   | PRIMARY | 4       | piwik.log_visit.idsite |        1 | Using where; Not exists |
+------+-------------+-----------+--------+---------------------------------------------------------------------------+---------+---------+------------------------+----------+-------------------------+
@tsteur commented on December 4th 2018 Member

This looks like a bug / regression indeed. Moving it therefore into 3.8.

The left join with site.idsite is null can't really work as you mention. Would need to be a right join.

Hopefully this will speed up the query and cause less trouble.

@tsteur commented on December 21st 2018 Member

I just tested and the query should work. I can see though how this maybe could affect performance and lock the table especially since it looks at all the data. I reckon more efficient might be to get a list of all existing sites, a list of all idsites used in log_visit, and remove the ones from log_visit with where idsite= ? where they no longer exist. I think we also have a method that doesn't delete all entries at once but does it in batches... however, I reckon if we delete by idSite, there wouldn't be a big LOCK cause it would only lock the rows that match that idSite and likely you would not even track data into it anymore for that site(cause it no longer exists and request would fail).

@tsteur commented on December 21st 2018 Member

Ideally we would also only run it, when sites were actually deleted. Otherwise there would be lots of full table scans I suppose.

This Issue was closed on December 22nd 2018
Powered by GitHub Issue Mirror