Refactor cron archiving for simplicity #15117

diosmosis · 2019-11-06T00:27:10Z

Now that we invalidate archives when a visit is tracked, we have a couple more opportunities for refactoring:

during invalidation we can create a new archive w/ done flag value = DONE_INVALIDATED if the archives for the dates do not exist
in CronArchive, instead of pulling individual sites and checking if there have been visits for those sites, just pull individual invalidated archives, set the done value to DONE_IN_PROGRESS, and initiate archiving for that one archive.

This should allow splitting archiving of large websites across separate processes and simplify the code quite a bit.

Some things to keep in mind:

by default we would look at the table for today's date and for everything in the invalidated sites list instead of iterating over every table, though there could be an option to do that.
we should not even start archiving if raw data has been deleted for an invalidated archive.

tsteur · 2019-11-07T20:02:43Z

Maybe also something to consider:

diosmosis · 2019-11-14T07:35:28Z

Basic implementation would be:

core:archive command:
- while true:
   - invalidated archives = getInvalidatedArchives()
   - if no invalidated archives, exit
   - for 1..# of parallel processes to create
      - for each invalidated archive:
        - create archive w/ DONE_IN_PROGRESS for archive if one not already there (removing from invalidated archives in the process)
        - if successful, break, otherwise keep going
      - add archive to list of archive jobs to launch
   - initiate archiving for each DONE_IN_PROGRESS archive via CliMulti

getInvalidatedArchives():
   - if invalidated archives in cache, return it
   - otherwise, loop through every archive and query for invalidated archives
   - set result in cache w/ TTL of 1 hour

thoughts @tsteur ?

tsteur · 2019-11-15T03:00:13Z

@diosmosis Wondering... When we create a done archive, could we in that moment directly delete any previously existing archive immediately during the archive process?

When we start an archive

Set start flag (probably what you mean with DONE_IN_PROGRESS)
archive...
set done flag & delete previous archive immediately (not sure it's doable but would maybe simplify things and there be never more than one archive for a date...?).

When a tracking request comes in:

We invalidate that archive so we know next time we need to archive that period again

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

It's probably not quite related to what you mentioned though. The logic sounds good though that you suggested. It's basically like a "queue" what needs to be archived and it's all based on invalidation. 👍

diosmosis · 2019-11-15T03:07:31Z

I think I understand what you're saying, basically, right after ArchiveWriter::finalizeArchive(), delete old archives, right? This could work and I think it is related to this, it would make it harder for an old getInvalidatedArchives() data cache to cause problems, but I suppose we wouldn't be able to get rid of ArchivePurger, since we'd still need to support the command to purge...

diosmosis · 2019-11-15T03:08:39Z

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

This should be easily do-able and I think we have to do it this way to do this refactor. lastN would just be wasteful.

diosmosis added the c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. label Nov 6, 2019

diosmosis added this to the 4.0.0 milestone Nov 6, 2019

This was referenced Nov 6, 2019

Do not proceed with archiving if a valid archive exists for the parameter combination. #14937

Merged

lastRunKey in Archiver is not reliable #15127

Closed

diosmosis self-assigned this Nov 12, 2019

diosmosis removed their assignment Nov 25, 2019

diosmosis self-assigned this Jan 17, 2020

diosmosis mentioned this issue Feb 1, 2020

Rewrite cron archiving process for easier maintenance and performance #15499

Merged

diosmosis mentioned this issue Mar 30, 2020

Archiving status: Reexpire lock only when needed #15747

Closed

This was referenced Apr 16, 2020

merge 3.x to 4.x #15821

Merged

Allow plugins to generate their reports using historical data (eg. Custom Reports) #11974

Closed

diosmosis closed this as completed in #15499 May 7, 2020

diosmosis mentioned this issue May 26, 2020

Archiving job prioritization and safety precautions #15991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cron archiving for simplicity #15117

Refactor cron archiving for simplicity #15117

diosmosis commented Nov 6, 2019

tsteur commented Nov 7, 2019

diosmosis commented Nov 14, 2019

tsteur commented Nov 15, 2019

diosmosis commented Nov 15, 2019

diosmosis commented Nov 15, 2019

Navigation Menu

Refactor cron archiving for simplicity #15117

Refactor cron archiving for simplicity #15117

Comments

diosmosis commented Nov 6, 2019

tsteur commented Nov 7, 2019

diosmosis commented Nov 14, 2019

tsteur commented Nov 15, 2019

diosmosis commented Nov 15, 2019

diosmosis commented Nov 15, 2019