Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

mattab · 2014-10-23T09:55:03Z

The goal of this issue is to make the query in purgeInvalidatedArchiveTable faster.

In Piwik 2.8.0 we're seeing that the nice SQL query introduced in #6116 is sometimes a bit slow.
This query is triggered for example when log import is used.
Maybe we could make this query faster?

Mysql slow query log report:


Count: 13  Time=342.83s (4456s)  Lock=0.00s (0s)  Rows=0.0 (0), 13users@2hosts
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_06` t1
  INNER JOIN `piwik_archive_blob_2014_06` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'


Count: 16  Time=253.72s (4059s)  Lock=0.00s (0s)  Rows=0.0 (0), 16users@2hosts
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_07` t1
  INNER JOIN `piwik_archive_blob_2014_07` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'

The text was updated successfully, but these errors were encountered:

mattab · 2014-10-28T01:26:21Z

Currently we purge invalidated data by looking over each archive table: https://github.com/piwik/piwik/blob/master/core/DataAccess/ArchivePurger.php#L28-29

Instead maybe we could keep track of which dates were invalidated, and only run the query on tables for the months where some data was invalidated?

tsteur · 2014-11-04T21:01:47Z

Do we have a sample SQL to actually test this? Maybe it can be split in two queries to make it faster? Not sure how many archives are usually invalidated... I think this is only executed daily so 3-10min sounds actually ok maybe. All this 'done' stuff seems to be a bit weird anyway. Maybe it makes sense to refactor it at some point.

mattab · 2014-11-04T22:44:38Z

The reason it's so slow is that it takes 3-10min per monthly archive table. So if a user has 5 years with PIwik, it will take 60 * 5min = 5 hours. I'd suggest to keep track of which months were invalidated (eg. in option table) and only run the query on tables for the months where some data was invalidated?

tsteur · 2014-11-04T22:52:42Z

different question: In this query why is there a check if there is also a OK or temporary archive? Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what. If needed it will be reprocessed.

I presume there is a reason for this as it wouldn't be there otherwise. Maybe we can at least fix the naming as invalid sounds confusing in case an invalid archive is still used.

mattab · 2014-11-04T23:35:09Z

Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what.

Ok sorry the code does not make this clear enough. basically users will import logs and at the end of log import it marks archives as invalidated. Now imagine the scheduled task runs after log import and before the next core:archive run (core:archive could run next only 10-20 hours afterwards). Then for those 10-20 hours, if we deleted all that was invalidated, we would have "no data" gaps. That's why we only delete data after it has been already re-processed (the check for OK/temp archive). This was done in #5932

Now I'm wondering how to best solve this issue... thinking.

…hiveInvalidator to include the API CoreAdminHome.invalidateArchivedReports and InvalidatedReports to store the info of which report was invalidated

mattab · 2014-11-05T04:44:14Z

The purge query is here: https://github.com/piwik/piwik/blob/master/core/DataAccess/Model.php#L25-25

Assumptions

the scheduled task can run many times (and/or for days) before core:archive next really re-processes the website(s) that was invalidated.
there can be many log imports at the same time each invalidating their own months and websites
core:archive may run several times before it archives all the websites that have been invalidated.

Notes

there is an index on archive_ tables INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived), that we should use to only invalidate data for websites that were imported. This is missing for now from the logic.

I started doing refactoring and I think I know how to fix it. I've self assigned issue!

…able and also only for the websites IDs that were invalidated

mattab added the Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. label Oct 23, 2014

mattab added this to the Short term milestone Oct 23, 2014

mattab modified the milestones: Piwik 2.9.0, Short term Oct 30, 2014

mattab self-assigned this Nov 5, 2014

mattab added a commit that referenced this issue Nov 5, 2014

refs #6508 add some comments

65ac550

mattab added a commit that referenced this issue Nov 5, 2014

refs #6508 started refactoring the code, created two new objects: Arc…

f99d2f9

…hiveInvalidator to include the API CoreAdminHome.invalidateArchivedReports and InvalidatedReports to store the info of which report was invalidated

mattab added a commit that referenced this issue Nov 5, 2014

refs #6508 log when invalid dates were found in the input

7503455

mattab added a commit that referenced this issue Nov 5, 2014

refs #6508 some refactoring and renaming

8c4ab6d

mattab added a commit that referenced this issue Nov 5, 2014

refs #6508 Now we will purge only the right monthly archive_numeric t…

7999b36

…able and also only for the websites IDs that were invalidated

mattab closed this as completed in 9c15e1d Nov 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

mattab commented Oct 23, 2014

mattab commented Oct 28, 2014

tsteur commented Nov 4, 2014

mattab commented Nov 4, 2014

tsteur commented Nov 4, 2014

mattab commented Nov 4, 2014

mattab commented Nov 5, 2014

Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

Comments

mattab commented Oct 23, 2014

mattab commented Oct 28, 2014

tsteur commented Nov 4, 2014

mattab commented Nov 4, 2014

tsteur commented Nov 4, 2014

mattab commented Nov 4, 2014

mattab commented Nov 5, 2014