@diosmosis opened this Issue on November 6th 2019 Member

Now that we invalidate archives when a visit is tracked, we have a couple more opportunities for refactoring:

  1. during invalidation we can create a new archive w/ done flag value = DONE_INVALIDATED if the archives for the dates do not exist
  2. in CronArchive, instead of pulling individual sites and checking if there have been visits for those sites, just pull individual invalidated archives, set the done value to DONE_IN_PROGRESS, and initiate archiving for that one archive.

This should allow splitting archiving of large websites across separate processes and simplify the code quite a bit.

Some things to keep in mind:

  • by default we would look at the table for today's date and for everything in the invalidated sites list instead of iterating over every table, though there could be an option to do that.
  • we should not even start archiving if raw data has been deleted for an invalidated archive.
@tsteur commented on November 7th 2019 Member
@diosmosis commented on November 14th 2019 Member

Basic implementation would be:

core:archive command:
- while true:
   - invalidated archives = getInvalidatedArchives()
   - if no invalidated archives, exit
   - for 1..# of parallel processes to create
      - for each invalidated archive:
        - create archive w/ DONE_IN_PROGRESS for archive if one not already there (removing from invalidated archives in the process)
        - if successful, break, otherwise keep going
      - add archive to list of archive jobs to launch
   - initiate archiving for each DONE_IN_PROGRESS archive via CliMulti

getInvalidatedArchives():
   - if invalidated archives in cache, return it
   - otherwise, loop through every archive and query for invalidated archives
   - set result in cache w/ TTL of 1 hour

thoughts @tsteur ?

@tsteur commented on November 15th 2019 Member

@diosmosis Wondering... When we create a done archive, could we in that moment directly delete any previously existing archive immediately during the archive process?

When we start an archive

  • Set start flag (probably what you mean with DONE_IN_PROGRESS)
  • archive...
  • set done flag & delete previous archive immediately (not sure it's doable but would maybe simplify things and there be never more than one archive for a date...?).

When a tracking request comes in:

  • We invalidate that archive so we know next time we need to archive that period again

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

It's probably not quite related to what you mentioned though. The logic sounds good though that you suggested. It's basically like a "queue" what needs to be archived and it's all based on invalidation. 👍

@diosmosis commented on November 15th 2019 Member

I think I understand what you're saying, basically, right after ArchiveWriter::finalizeArchive(), delete old archives, right? This could work and I think it is related to this, it would make it harder for an old getInvalidatedArchives() data cache to cause problems, but I suppose we wouldn't be able to get rid of ArchivePurger, since we'd still need to support the command to purge...

@diosmosis commented on November 15th 2019 Member

Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?

This should be easily do-able and I think we have to do it this way to do this refactor. lastN would just be wasteful.

Powered by GitHub Issue Mirror