Now that we invalidate archives when a visit is tracked, we have a couple more opportunities for refactoring:
This should allow splitting archiving of large websites across separate processes and simplify the code quite a bit.
Some things to keep in mind:
Basic implementation would be:
core:archive command: - while true: - invalidated archives = getInvalidatedArchives() - if no invalidated archives, exit - for 1..# of parallel processes to create - for each invalidated archive: - create archive w/ DONE_IN_PROGRESS for archive if one not already there (removing from invalidated archives in the process) - if successful, break, otherwise keep going - add archive to list of archive jobs to launch - initiate archiving for each DONE_IN_PROGRESS archive via CliMulti getInvalidatedArchives(): - if invalidated archives in cache, return it - otherwise, loop through every archive and query for invalidated archives - set result in cache w/ TTL of 1 hour
thoughts @tsteur ?
@diosmosis Wondering... When we create a
done archive, could we in that moment directly delete any previously existing archive immediately during the archive process?
When we start an archive
When a tracking request comes in:
Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with
last52 days but instead use
&date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?
It's probably not quite related to what you mentioned though. The logic sounds good though that you suggested. It's basically like a "queue" what needs to be archived and it's all based on invalidation. 👍
I think I understand what you're saying, basically, right after ArchiveWriter::finalizeArchive(), delete old archives, right? This could work and I think it is related to this, it would make it harder for an old getInvalidatedArchives() data cache to cause problems, but I suppose we wouldn't be able to get rid of ArchivePurger, since we'd still need to support the command to purge...
Ideally, when an archive from 50 days ago is invalidated, the archiver be ideally smart enough to not launch with last52 days but instead use &date=$invalidatedDate&period=$periodToArchive. This might bring some performance boost to really only archive the needed data?
This should be easily do-able and I think we have to do it this way to do this refactor. lastN would just be wasteful.