Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cron archiving for simplicity #15117

Closed
diosmosis opened this issue Nov 6, 2019 · 5 comments · Fixed by #15499
Closed

Refactor cron archiving for simplicity #15117

diosmosis opened this issue Nov 6, 2019 · 5 comments · Fixed by #15499
Assignees
Labels
c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself.
Milestone

Comments

@diosmosis
Copy link
Member

Now that we invalidate archives when a visit is tracked, we have a couple more opportunities for refactoring:

  1. during invalidation we can create a new archive w/ done flag value = DONE_INVALIDATED if the archives for the dates do not exist
  2. in CronArchive, instead of pulling individual sites and checking if there have been visits for those sites, just pull individual invalidated archives, set the done value to DONE_IN_PROGRESS, and initiate archiving for that one archive.

This should allow splitting archiving of large websites across separate processes and simplify the code quite a bit.

Some things to keep in mind:

  • by default we would look at the table for today's date and for everything in the invalidated sites list instead of iterating over every table, though there could be an option to do that.
  • we should not even start archiving if raw data has been deleted for an invalidated archive.
@diosmosis diosmosis added the c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. label Nov 6, 2019
@diosmosis diosmosis added this to the 4.0.0 milestone Nov 6, 2019
@diosmosis diosmosis self-assigned this Nov 12, 2019
@diosmosis
Copy link
Member Author

Basic implementation would be:

core:archive command:
- while true:
   - invalidated archives = getInvalidatedArchives()
   - if no invalidated archives, exit
   - for 1..# of parallel processes to create
      - for each invalidated archive:
        - create archive w/ DONE_IN_PROGRESS for archive if one not already there (removing from invalidated archives in the process)
        - if successful, break, otherwise keep going
      - add archive to list of archive jobs to launch
   - initiate archiving for each DONE_IN_PROGRESS archive via CliMulti

getInvalidatedArchives():
   - if invalidated archives in cache, return it
   - otherwise, loop through every archive and query for invalidated archives
   - set result in cache w/ TTL of 1 hour

thoughts @tsteur ?

@tsteur
Copy link
Member

tsteur commented Nov 15, 2019

@diosmosis Wondering... When we create a done archive, could we in that moment directly delete any previously existing archive immediately during the archive process?

When we start an archive

  • Set start flag (probably what you mean with DONE_IN_PROGRESS)
  • archive...
  • set done flag & delete previous archive immediately (not sure it's doable but would maybe simplify things and there be never more than one archive for a date...?).

When a tracking request comes in:

  • We invalidate that archive so we know next time we need to archive that period again

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

It's probably not quite related to what you mentioned though. The logic sounds good though that you suggested. It's basically like a "queue" what needs to be archived and it's all based on invalidation. 👍

@diosmosis
Copy link
Member Author

I think I understand what you're saying, basically, right after ArchiveWriter::finalizeArchive(), delete old archives, right? This could work and I think it is related to this, it would make it harder for an old getInvalidatedArchives() data cache to cause problems, but I suppose we wouldn't be able to get rid of ArchivePurger, since we'd still need to support the command to purge...

@diosmosis
Copy link
Member Author

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

This should be easily do-able and I think we have to do it this way to do this refactor. lastN would just be wasteful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants