Changed only the function insertActionsToKeep
Note: Test failures are unrelated to this changes
We will need to have a good look at it and test it etc. I noticed the current version is super slow and takes ages to purge data even for a fairly small site. cc @mattab
If you need more info, I am happy to provide.
Code looks good but would need to test it and then also compare performance. Biggest performance problem is the start from
min(id) I reckon 👍
Did you test whether the insert you did is faster than the
@tsteur Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production. Do not know why. It is a long time. But still when the minimal value is somewhat like 1200000 it will save about
12 X amount of fields off queries running through the hole database without finding anything. So the min(id) is to reduce the amount of 'empty' result queries due to increased auto numbering field
But lets take the following example. We have the table
piwik_log_link_visit_action. There all the ID fields are queried. So there is a loop over the fields
idsite,idvisitor,idvisit,idaction_url_ref,etc.. That will request only one field from the table. And then walks through the complete table (of which we have about 556970018 records) The code will loop
amount of fields X (total records in table / query limit). This will take at least 1 week to process. Almost 2 weeks.
So what I did was removing the loop over the fields
idsite,idvisitor,idvisit,idaction_url_ref,etc.. and created a single query that is requesting all those fields at once. As the data is (big)int, it cannot be that big, and should not give memory errors. Now you will only have
1 X (total records in table / query limit) actions. This is the big time saver.
So I get a result set back that contains all the data that should be saved in the temporary table. In order to do this, the following query needs to be created 'INSERT IGNORE INTO [TEMPTABLE] VALUES (idsite),(idvisitor),(idvisit),(idaction_url_ref)'. So I create a single query that insert 1000 records at once. This 1000 is just a number. It could be that 10000 is also possible. But that is depending on the amount of available memory.
My insert query is not faster then
insert ...select. But that is not the case here. The case is, that I have reduced the total amount of queries that is run during the cleanup. Because you are using
insert ...select you are basically forced to select only one field at a time. And this will force multiple loops over the same data set which is a killer.
We have noticed that this fix will use a lot of disk. Our SSD showed a 70% utilization when running this new cleanup code. So it will take more resources to cleanup, but now it only takes about 2 days in stead of 2 weeks. So we are happy with it
Selecting min(id) will take by it selfs multiple minutes. Do not know why, but somehow it is hard to get the minimum value from a table. I thought that it was an index, so that should be fast. Have not yet investigated as I do not see this as an issue yet. So what I have seen, is that selecting the minimum value took 12 minutes on production.
Interesting, as there is an index on this field (first column and on log_visit table even primary key) I would also have expected this query to be very quite fast.
Ping? any updates on this?
Let's review this Pull Request for the next release. We also found this to be a slow part of Matomo and needs to be improved :+1:
That would be awesome! We have created a patch for this and we are running our production platform with this patch for one year now. We do not see any issues regarding missing data. And it is still very fast :)