Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow query from purgeInvalidatedArchiveTable (query is 3-10min long on bigger instances) #6508

Closed
mattab opened this issue Oct 23, 2014 · 6 comments
Assignees
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Oct 23, 2014

The goal of this issue is to make the query in purgeInvalidatedArchiveTable faster.

In Piwik 2.8.0 we're seeing that the nice SQL query introduced in #6116 is sometimes a bit slow.
This query is triggered for example when log import is used.
Maybe we could make this query faster?

Mysql slow query log report:


Count: 13  Time=342.83s (4456s)  Lock=0.00s (0s)  Rows=0.0 (0), 13users@2hosts
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_06` t1
  INNER JOIN `piwik_archive_blob_2014_06` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'


Count: 16  Time=253.72s (4059s)  Lock=0.00s (0s)  Rows=0.0 (0), 16users@2hosts
  SELECT t1.idarchive FROM `piwik_archive_blob_2014_07` t1
  INNER JOIN `piwik_archive_blob_2014_07` t2
  ON t1.name = t2.name AND t1.idsite=t2.idsite
  AND t1.date1=t2.date1 AND t1.date2=t2.date2 AND t1.period=t2.period
  WHERE t1.value = N
  AND t2.value IN(N, N)
  AND t1.ts_archived < t2.ts_archived AND t1.name LIKE 'S'


@mattab mattab added the Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. label Oct 23, 2014
@mattab mattab added this to the Short term milestone Oct 23, 2014
@mattab
Copy link
Member Author

mattab commented Oct 28, 2014

Currently we purge invalidated data by looking over each archive table: https://github.com/piwik/piwik/blob/master/core/DataAccess/ArchivePurger.php#L28-29

Instead maybe we could keep track of which dates were invalidated, and only run the query on tables for the months where some data was invalidated?

@mattab mattab modified the milestones: Piwik 2.9.0, Short term Oct 30, 2014
@tsteur
Copy link
Member

tsteur commented Nov 4, 2014

Do we have a sample SQL to actually test this? Maybe it can be split in two queries to make it faster? Not sure how many archives are usually invalidated... I think this is only executed daily so 3-10min sounds actually ok maybe. All this 'done' stuff seems to be a bit weird anyway. Maybe it makes sense to refactor it at some point.

@mattab
Copy link
Member Author

mattab commented Nov 4, 2014

The reason it's so slow is that it takes 3-10min per monthly archive table. So if a user has 5 years with PIwik, it will take 60 * 5min = 5 hours. I'd suggest to keep track of which months were invalidated (eg. in option table) and only run the query on tables for the months where some data was invalidated?

@tsteur
Copy link
Member

tsteur commented Nov 4, 2014

different question: In this query why is there a check if there is also a OK or temporary archive? Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what. If needed it will be reprocessed.

I presume there is a reason for this as it wouldn't be there otherwise. Maybe we can at least fix the naming as invalid sounds confusing in case an invalid archive is still used.

@mattab
Copy link
Member Author

mattab commented Nov 4, 2014

Invalidated sounds to me as in "it is invalid" meaning it can be removed no matter what.

Ok sorry the code does not make this clear enough. basically users will import logs and at the end of log import it marks archives as invalidated. Now imagine the scheduled task runs after log import and before the next core:archive run (core:archive could run next only 10-20 hours afterwards). Then for those 10-20 hours, if we deleted all that was invalidated, we would have "no data" gaps. That's why we only delete data after it has been already re-processed (the check for OK/temp archive). This was done in #5932

Now I'm wondering how to best solve this issue... thinking.

@mattab mattab self-assigned this Nov 5, 2014
mattab added a commit that referenced this issue Nov 5, 2014
mattab added a commit that referenced this issue Nov 5, 2014
…hiveInvalidator to include the API CoreAdminHome.invalidateArchivedReports and InvalidatedReports to store the info of which report was invalidated
@mattab
Copy link
Member Author

mattab commented Nov 5, 2014

The purge query is here: https://github.com/piwik/piwik/blob/master/core/DataAccess/Model.php#L25-25

Assumptions

  • the scheduled task can run many times (and/or for days) before core:archive next really re-processes the website(s) that was invalidated.
  • there can be many log imports at the same time each invalidating their own months and websites
  • core:archive may run several times before it archives all the websites that have been invalidated.

Notes

  • there is an index on archive_ tables INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived), that we should use to only invalidate data for websites that were imported. This is missing for now from the logic.

I started doing refactoring and I think I know how to fix it. I've self assigned issue!

mattab added a commit that referenced this issue Nov 5, 2014
mattab added a commit that referenced this issue Nov 5, 2014
…able and also only for the websites IDs that were invalidated
@mattab mattab closed this as completed in 9c15e1d Nov 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

2 participants