@diosmosis opened this Pull Request on February 1st 2020 Member

New workflow

The current workflow for core:archive is to check if there are visits for a site, then launching archiving w/ lastN days/weeks/months/years. The new workflow is: invalidate archive, run core:archive and let it process the queue.

Relevant changes

  • Archive invalidation now creates archive rows w/ NULL ts_archived if there isn't an existing archive to set the done value for. This allows core:archive to notice and process the archive, even though one never existed beforehand.
  • core:archive now selects the next invalidated archive from the archive tables and marks it as in progress so we don't have to look for commands that are already running to know if an archive is being processed.

Notes

  • there are some bug fixes to the archive invalidation code here, some archives were not invalidated that should be, and some archives were invalidated that did not need to be.
  • there is an optimization in Loader.php that vastly improves the performance for archiving sites with no visits (as demonstration, all the system test jobs are much faster now; the one w/ ArchiveCronTest eg has improved from 21 mins to 6 mins).

Fixes #15117

@diosmosis commented on March 30th 2020 Member

@tsteur this PR is ready for an initial review. tests might still need some fixing (haven't checked the last build since I dealt w/ some TODOs). There are some TODOs in the code, those are things I wasn't sure how to solve completely myself, so would appreciate your opinion.

@tsteur commented on April 1st 2020 Member

Regarding the locks: If they affect browser archiving, be great to know what that means for that. If there are two requests around the same time will they basically wait for the other one to finish etc. Can a request run into a lock wait timeout (like it only waits for 30s for a lock but the other request needs longer) and then return no result or something.

@diosmosis commented on April 1st 2020 Member

@tsteur

I started looking through the code but it's a bit hard to understand everything. Would it be possible in the PR itself to provide a bit more insights on how things work now (eg what happens when a tracking request is recorded for today or a previous day how are things invalidated and how does the archiver then know this will be archived).

I can try, but it's a fairly complex PR... I'll number the comments in the order they should be read and walk you through the whole process as it exists in this PR, hopefully it will help?

It be also interesting to get some insights on the locking (since locks are prone to many issues/problems) on what it does and how it works.

The locks are mainly a secondary source of thread safety, if an archive gets set as DONE_IN_PROGRESS but fails then the lock lets us know that it's expired and we can try to re-archive it.

Also curious to know if this PR already aims to fix #11974 as part of this refactoring and/or whether this new logic will make it possible or whether there would be maybe issues with that?

It doesn't solve that issue, we don't re-archive plugin specific archives in core:archive, just trigger whole archives. We'd have to do the following to make that work:

  • Allow invalidating plugin specific archives.
  • Pick up plugin specific archives and be able to trigger them generically through a climulti:request rather than just API.get (and somehow know which ones are dependent archives and don't need to be re-archived, eg, the VisitorInterest/Goals segments).
@diosmosis commented on April 2nd 2020 Member

Regarding the locks: If they affect browser archiving, be great to know what that means for that. If there are two requests around the same time will they basically wait for the other one to finish etc. Can a request run into a lock wait timeout (like it only waits for 30s for a lock but the other request needs longer) and then return no result or something.

Locks are only there for core:archive to check if a DONE_IN_PROGRESS archive has probably failed or something. This PR shouldn't change browser archiving at all.

@diosmosis commented on April 2nd 2020 Member

@tsteur left some comments that should be read in order, hopefully that will be enough to give you some understanding.

@diosmosis commented on April 3rd 2020 Member

BTW do we need to check whether the segment is configured to be archived by the cron vs in real time? And if it is supposed to be archived in real time, do we need to skip it and look for the next one?

This is done in SegmentArchivingRequestUrlProvider

@diosmosis commented on April 3rd 2020 Member

Would this be avoided through the lock knowing archiving is still in progress?

It should, yes. Though in this case the invalidation wouldn't have an effect, not sure if we want it to be scheduled for re-archiving again? Probably not... We could also not invalidate archives set to DONE_IN_PROGRESS.

@diosmosis commented on April 3rd 2020 Member

I haven't checked the history but the idea of this flag is to bypass the queue and force this core:archive process to always go through all sites one by one and archive them. How would we do this if the option is removed?

That's the default behavior. It will look for archives no matter the site.

@mattab commented on April 7th 2020 Member

I haven't checked the history but the idea of this flag is to bypass the queue and force this core:archive process to always go through all sites one by one and archive them. How would we do this if the option is removed?
That's the default behavior. It will look for archives no matter the site.

Does that mean that if I run the core:archive script, I'm always guaranteed it will always archive all sites in this one run? (when using force-all-sites we want this process to archive all sites even if multiple other processes may be running and going through the archives already, or if some of the sites were going to be skipped because they have no data recently... )

@diosmosis commented on April 15th 2020 Member

Does that mean that if I run the core:archive script, I'm always guaranteed it will always archive all sites in this one run?

Yes, unless you run w/ --force-idsites (or whatever the option is).

@diosmosis commented on April 15th 2020 Member
@diosmosis commented on April 15th 2020 Member

@tsteur should be ready for another review.

@diosmosis commented on April 21st 2020 Member

One of the things I'm most worried about will be the many SELECT xyz from archive_... we do.

This is still an issue. The conclusion you came to (IIRC) is to just see what happens on demo. I made a suggestion for adding an invalidation_time column so we could index on it, otherwise we could index on name for 4 chars. Not sure which would be better?

It might also be nice to have two tables one for archive status and another for archive data rather than one for blob data and one for numeric data + archive status data. But that would be more work.

do I understand this correctly that we now first finish archiving all invalidated archives of one site before moving on to the site?

Yes, we could force processes to not process an in-progress site by using a Lock that is re-expired say once every hour or something.

@diosmosis commented on April 22nd 2020 Member

@tsteur fyi left a comment ^

@tsteur commented on April 23rd 2020 Member

This is still an issue. The conclusion you came to (IIRC) is to just see what happens on demo. I made a suggestion for adding an invalidation_time column so we could index on it, otherwise we could index on name for 4 chars. Not sure which would be better?

The only problem is that on demo we can't really know how it impacts IO (which we can on AWS)... We'd need to see how to test this. Maybe we find an alternative solution as part of the plugin specific archives. I suppose an index or a new column with an index is otherwise quickly added.

do I understand this correctly that we now first finish archiving all invalidated archives of one site before moving on to the site?

Yes, we could force processes to not process an in-progress site by using a Lock that is re-expired say once every hour or something.

Ideally we wouldn't want to use a lock if any possible just because it might create random issues. If we could ideally use the original SharedSiteIds behaviour I think that be awesome. It worked quite well and it also makes sure that if in the last hour some other archiver was working on a given idSite, then a concurrent running archiver won't do the same work and not look at any of these archives and completely skip the site.

@tsteur commented on April 23rd 2020 Member

BTW @diosmosis with this solution do we still need the concurrent lock which we reexpire from time to time etc? Seeing these queries to get the lock etc quite often when debugging and if we didn't need it, that be awesome (I think last time you mentioned it will be needed for this refactoring but don't remember why or what it does)

@diosmosis commented on April 23rd 2020 Member

@tsteur it's needed to know when an archiving job was killed or was stopped for some other reason. in that case the archive row is set to DONE_IN_PROGRESS, but never gets finalized. So we use the lock + it's expiration time to know if we can restart it.

Edit: it might be possible to remove the need if we have a start or expiration time in the archive and update that instead. Though that might just move the I/O from the lock table to the archive table.

Edit 2: actually, maybe we can fix this by adding the invalidation table, can you see my comment in the other issue?

@diosmosis commented on April 28th 2020 Member

@tsteur ready for another review, there is one test w/ a random failure (fails due to day change), and a couple minor TODO, but can be reviewed.

@diosmosis commented on May 6th 2020 Member

@tsteur applied review feedback + answered questions + created another PR for the error you encountered: https://github.com/matomo-org/matomo/pull/15919

@tsteur commented on May 6th 2020 Member

Works for me now @diosmosis and reckon we can merge if tests pass (currently they are all failing).

BTW... I was thinking with this new table if we should change the logic in https://github.com/matomo-org/matomo/blob/3.13.5/core/Archive/ArchiveInvalidator.php#L76-L91 ?

In Tracker/Visit.php we mark an archive to be invalidated if the tracking request is for a previous day. In tracking request it then checks if the tracker cache has already registered an invalidation or not by querying the tracker cache. This tracker cache is populated from the getAllRememberToInvalidateArchivedReportsLater method and wondering if this can be all replaced now to also directly query the invalidation table to have one "source of truth" only and not have a similar queue again in the options table etc. rememberToInvalidateArchivedReportsLater could then also directly insert an entry into the archive invalidation table instead of the options table?

The only thing I can think of, is that when many archive invalidations are queued eg because of a changed segment, then the tracker cache might be getting a bit big but this shouldn't be the case normally.

Not sure it's clear what I mean? Any thoughts?

@diosmosis commented on May 6th 2020 Member

Not sure it's clear what I mean? Any thoughts?

If you mean query archive_invalidations in the tracker for rememberToInvalidateArchivedReportsLater, then it means it might make archive_invalidations a bottleneck? I guess it's hard to say how it would behave... the invalidations table will generally be smaller though (except when core:archive isn't called), so maybe it would be better... could try it in another PR?

@tsteur commented on May 7th 2020 Member

If you mean query archive_invalidations in the tracker for rememberToInvalidateArchivedReportsLater, then it means it might make archive_invalidations a bottleneck

I meant calling it when populating the tracker cache... and in tracking mode we would check the tracker cache (not the DB directly). Pretty much like it works now just we don't look at options table anymore when populating the cache or when remembering to invalidate an archive. Bottleneck wise it should in the end be the same no matter if we set options table or add an entry to invalidate archived reports table.

@tsteur commented on May 7th 2020 Member

Actually @diosmosis I realise this might not be best to do since we'd need to invalidate directly also the week, month, year and per tracking request create several invalidation entries... so maybe we keep the option table for now... and PR should be good to merge once tests pass

This Pull Request was closed on May 7th 2020
Powered by GitHub Issue Mirror