New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406
Conversation
…icate check in tests.
@@ -201,6 +201,13 @@ public function getNextArchivesToProcess() | |||
continue; | |||
} | |||
|
|||
if ($this->model->isSimilarArchiveInProgress($invalidatedArchive)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis I now wonder if things need to be more complicated... like do we need to know the "Archive start time" of that archive that is already in progress and only cancel invalidations that were added to the invalidations table before the start time?
Eg
- add invalidation 1 for Month Sep 2020 at 7AM
- add invalidation 2 for Month Sep 2020 at 8AM
- add invalidation 3 for Month Sep 2020 at 9AM
- start archiving Month Sep 2020 at 09:30 AM
- add invalidation 4 for Month Sep 2020 at 10AM
- add invalidation 5 for Month Sep 2020 at 11AM
- archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
- Meaning 1,2,3 would be removed, and 4 and 5 would be just skipped here?
Not sure if that's possible or if I'm missing something.
This way we make sure to start an archive again after that one finished and include all data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what would currently happen:
- add invalidation 1 for Month Sep 2020 at 7AM
- add invalidation 2 for Month Sep 2020 at 8AM
- add invalidation 3 for Month Sep 2020 at 9AM
- start archiving Month Sep 2020 at 09:30 AM
- core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and ignores it
- core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and ignores it
- something adds invalidation 4 for Month Sep 2020 at 10AM
- something add invalidation 5 for Month Sep 2020 at 11AM
- archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
- core:archive pulls the next archive, sees invalidation 2. it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).
- same thing happens for 2-5
- next hourly cron, invalidation 2 is picked up. and the process repeats.
w/ this PR:
- add invalidation 1 for Month Sep 2020 at 7AM
- add invalidation 2 for Month Sep 2020 at 8AM
- add invalidation 3 for Month Sep 2020 at 9AM
- start archiving Month Sep 2020 at 09:30 AM
- core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and discards it
- core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and discards it
- something adds invalidation 4 for Month Sep 2020 at 10AM
- something add invalidation 5 for Month Sep 2020 at 11AM
- archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
- core:archive pulls the next archive, sees invalidation 4. there is none in progress, so it passes that check. then it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).
- same thing happens to invalidation 5.
- next hourly cron, invalidation 4 is picked up. it is started.
- invalidation 5 is seen, since invalidation 4 is in progress, it is discarded.
- invalidation 4 is finished and there are no more invalidations left.
I don't think involving time is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure I'm fully understanding it @diosmosis
If there are multiple archivers launching at the very same time (which we support), or if there are more archivers starting between 9:30am (start of the archive) and 11:30am (finish of archive). Would one of the other archiving jobs maybe try to process invalidation 4 and 5 and remove them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur if you mean multiple core:archive processes, then no, because we still use the site ID queue (ie, SharedSiteIds), so we shouldn't be archiving the same site at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's still the case but I assume it might be. What would happen is that when all sites have been processed, the queue would reset and all sites would be added again. For simplicity lets say there's only one site. The 1 site would be taken out of the queue thus the queue be then empty. The next time the cron runs it would add idSite=1 again. So they might run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis I just double checked and the same site might be processed concurrently in 3.x eg if you use SharedSiteIds
or if you use force-idsites=X
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it complicated to add @diosmosis ? It would certainly prevent possible random issues of not rearchived data having this there I would say
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it + a fix that the test picked up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis did you push the fix? Only seeing 1 commit so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur pushed to the wrong branch, should be here now
…ertain duplicate invalidations. Also fix issue w/ done flag detection, should not process done flag types together otherwise segment can be done before day.
@diosmosis generally looks good to merge. Can you look at the failing tests before merging? There's a test failing because now the output of the archiving changes. Not sure if https://travis-ci.org/github/matomo-org/matomo/jobs/726843740#L717 is related to this change |
That's random, we should probably change how we assert there, we just need to check it's different than before. Will look into it. |
@@ -229,6 +227,18 @@ public function getNextArchivesToProcess() | |||
continue; | |||
} | |||
|
|||
$alreadyInProgressId = $this->model->isArchiveAlreadyInProgress($invalidatedArchive); | |||
if ($alreadyInProgressId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I now wonder if this would actually work because we order the invalidations by ORDER BY date1 DESC, period ASC, CHAR_LENGTH(name) ASC, idinvalidation ASC
. Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?
On the other side we can't use the highest ID because it would maybe remove too many.
eg
- invalidation 3: month sep 2020
- invalidation 4: day sep 2020 14
- invalidation 5: month sep 2020
eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.
Not sure it's clear what I mean but selecting the highest ID maybe wouldn't work either and could generally lead to maybe wrong data.
Just wanted to check in if that's actually an issue that we select the lowest ID and then this would maybe not work. Not sure though might not be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur if this is about keeping different machines/processes from doing the same archiving job, then it should be fine? One process starts 3 because it's the earliest idinvalidation. Another picks up 5 and ignores it because 3 is in progress, but since 5 > 3, it's kept in the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?
Ok, understanding this better... we could probably do idinvalidation DESC... will think a bit about whether that would be an issue...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the issue be that when
invalidation 3: month sep 2020
invalidation 4: month sep 2020
invalidation 5: day sep 2020 14
invalidation 6: month sep 2020
it works on 3, a parallel process would pick 4. It would notice it's higher than 3 and skip it but not delete it. So it then never actually deletes any of the duplicate ones. On the other side we also cannot fetch idinvalidation desc
because we need to make sure after invalidation 5 the month will be reprocessed again.
The only way to prevent this is not to do a bulk insert and instead before the insert delete any already existing entry for the same period but not even sure if this could have any edge cases/race conditions (performance wise it should be fine considering there's an index on idsite,date1,period
but would of course result in an additional query for each insert which isn't exactly great but maybe needed not sure). Although I reckon it should work because things are sorted by ORDER BY date1 DESC, period ASC
and it should always archive first any lower invalidated period before archiving the higher period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.
so in this scenario:
- invalidation 4 (day) would be found first and archived (we order by period asc too)
- then invalidation 5, (we'd perhaps delete all invalidations below 5 just after starting it)
then we're fine. I guess it's more complicated when invalidations are added in between archiving. For example:
- invalidation 4 (day)
- start on invalidation 5
- meanwhile an invalidation for sep 14 is added back
- after invalidation 5 is done, we see sep 14 and re-archive
- the month is no longer accurate
Of course this would only be an issue if somehow sep 14 is invalidated, but no parent periods as well... which isn't the case unless there's a bug. In this case it would still work I think.
Will try w/ idinvalidation DESC since it definitely won't work right w/ idinvalidation ASC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it probably works with DESC
since it should archive any lower period first anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another scenario:
- invalidation 5 is started
- new month invalidation comes in
- we finish 5, see new month one. it's too soon since the last archive so we don't start it, but we don't remove it
this seems ok to me too?
btw was just checking the invalidations on the demo. There seem to be heaps of invalidations at the same second. It's idSite 60 which is a roll up. not sure if that's a different issue or not. It should be fixed though kind of with this issue but ideally it wasn't creating that many entries in the first place maybe. That's demo, not demo2. |
@tsteur that's strange... not sure what would invalidate so much... I'll debug core:archive on demo. |
can't reproduce on demo by executing core:archive manually... will try something else |
I think this has something to do w/ invalidating today, then invalidating yesterday. There are two week period duplicates w/ the same timestamp and 6 month period duplicates also w/ the same timestamp. Not sure why there are two weeks and 6 months though... |
@tsteur finally found the cause of the issue, it's due to RollUpReporting handling |
@diosmosis ok thanks. I thought I had a fix for this but didn't quite work out (in the |
This shouldn't be the case? When we get the highest ID, we start it and when successfully started, we remove all lower identical invalidations. All in the same archiving run. |
There's an event in the invalidate method, so in each separate invalidate call, demo adds the same idSite, resulting in the same periods being invalidated. We could do a complex query before the batch insert to check whether every one we're looking for is already present, but that would make the method even more complex... would be easy to test though so I can add this. |
👍 The query shouldn't be needed. Was only meaning if it's in PHP in memory already. |
Ah I see... we could use the transient cache w/ a ttl (since of course core:archive would just keep using the cache). It would make debugging a bit cleaner so I can see why it would be useful. |
plugins/CoreAdminHome/Tasks.php
Outdated
@@ -112,8 +112,12 @@ public function invalidateOutdatedArchives() | |||
return; | |||
} | |||
|
|||
$cronArchive = new CronArchive(); | |||
$cronArchive->invalidateArchivedReportsForSitesThatNeedToBeArchivedAgain(); | |||
$idSites = Request::processRequest('SitesManager.getAllSites'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just fyi @diosmosis could have used getAllSitesId
method but not too important. Won't need to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually quickly changed it just fyi
@tsteur here is the extra check. I didn't add a check when inserting since we're using load data infile so removing duplicates is complicated w/o a UNIQUE index.
You can look at the queue consumer test to verify.