Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

diosmosis · 2020-09-08T05:15:40Z

@tsteur here is the extra check. I didn't add a check when inserting since we're using load data infile so removing duplicates is complicated w/o a UNIQUE index.

You can look at the queue consumer test to verify.

…icate check in tests.

tsteur · 2020-09-08T06:15:17Z

core/CronArchive/QueueConsumer.php

@@ -201,6 +201,13 @@ public function getNextArchivesToProcess()
                continue;
            }

+            if ($this->model->isSimilarArchiveInProgress($invalidatedArchive)) {


@diosmosis I now wonder if things need to be more complicated... like do we need to know the "Archive start time" of that archive that is already in progress and only cancel invalidations that were added to the invalidations table before the start time?

Eg

add invalidation 1 for Month Sep 2020 at 7AM

add invalidation 2 for Month Sep 2020 at 8AM

add invalidation 3 for Month Sep 2020 at 9AM

start archiving Month Sep 2020 at 09:30 AM

add invalidation 4 for Month Sep 2020 at 10AM

add invalidation 5 for Month Sep 2020 at 11AM

archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?

Meaning 1,2,3 would be removed, and 4 and 5 would be just skipped here?

Not sure if that's possible or if I'm missing something.

This way we make sure to start an archive again after that one finished and include all data.

This is what would currently happen:

add invalidation 1 for Month Sep 2020 at 7AM

add invalidation 2 for Month Sep 2020 at 8AM

add invalidation 3 for Month Sep 2020 at 9AM

start archiving Month Sep 2020 at 09:30 AM

core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and ignores it

core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and ignores it

something adds invalidation 4 for Month Sep 2020 at 10AM

something add invalidation 5 for Month Sep 2020 at 11AM

archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?

core:archive pulls the next archive, sees invalidation 2. it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).

same thing happens for 2-5

next hourly cron, invalidation 2 is picked up. and the process repeats.

w/ this PR:

add invalidation 1 for Month Sep 2020 at 7AM

add invalidation 2 for Month Sep 2020 at 8AM

add invalidation 3 for Month Sep 2020 at 9AM

start archiving Month Sep 2020 at 09:30 AM

core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and discards it

core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and discards it

something adds invalidation 4 for Month Sep 2020 at 10AM

something add invalidation 5 for Month Sep 2020 at 11AM

archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?

core:archive pulls the next archive, sees invalidation 4. there is none in progress, so it passes that check. then it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).

same thing happens to invalidation 5.

next hourly cron, invalidation 4 is picked up. it is started.

invalidation 5 is seen, since invalidation 4 is in progress, it is discarded.

invalidation 4 is finished and there are no more invalidations left.

I don't think involving time is needed.

Just to be sure I'm fully understanding it @diosmosis

If there are multiple archivers launching at the very same time (which we support), or if there are more archivers starting between 9:30am (start of the archive) and 11:30am (finish of archive). Would one of the other archiving jobs maybe try to process invalidation 4 and 5 and remove them?

@tsteur if you mean multiple core:archive processes, then no, because we still use the site ID queue (ie, SharedSiteIds), so we shouldn't be archiving the same site at the same time.

I'm not sure if it's still the case but I assume it might be. What would happen is that when all sites have been processed, the queue would reset and all sites would be added again. For simplicity lets say there's only one site. The 1 site would be taken out of the queue thus the queue be then empty. The next time the cron runs it would add idSite=1 again. So they might run in parallel.

@diosmosis I just double checked and the same site might be processed concurrently in 3.x eg if you use SharedSiteIds or if you use force-idsites=X.

Is it complicated to add @diosmosis ? It would certainly prevent possible random issues of not rearchived data having this there I would say

Added it + a fix that the test picked up.

@diosmosis did you push the fix? Only seeing 1 commit so far

@tsteur pushed to the wrong branch, should be here now

…ertain duplicate invalidations. Also fix issue w/ done flag detection, should not process done flag types together otherwise segment can be done before day.

tsteur · 2020-09-13T22:10:42Z

@diosmosis generally looks good to merge. Can you look at the failing tests before merging?

There's a test failing because now the output of the archiving changes.

Not sure if https://travis-ci.org/github/matomo-org/matomo/jobs/726843740#L717 is related to this change

diosmosis · 2020-09-13T22:20:35Z

Not sure if https://travis-ci.org/github/matomo-org/matomo/jobs/726843740#L717 is related to this change

That's random, we should probably change how we assert there, we just need to check it's different than before. Will look into it.

tsteur · 2020-09-14T06:32:41Z

core/CronArchive/QueueConsumer.php

@@ -229,6 +227,18 @@ public function getNextArchivesToProcess()
                continue;
            }

+            $alreadyInProgressId = $this->model->isArchiveAlreadyInProgress($invalidatedArchive);
+            if ($alreadyInProgressId) {


btw I now wonder if this would actually work because we order the invalidations by ORDER BY date1 DESC, period ASC, CHAR_LENGTH(name) ASC, idinvalidation ASC. Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?

On the other side we can't use the highest ID because it would maybe remove too many.

eg

invalidation 3: month sep 2020

invalidation 4: day sep 2020 14

invalidation 5: month sep 2020

eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.

Not sure it's clear what I mean but selecting the highest ID maybe wouldn't work either and could generally lead to maybe wrong data.

Just wanted to check in if that's actually an issue that we select the lowest ID and then this would maybe not work. Not sure though might not be an issue

@tsteur if this is about keeping different machines/processes from doing the same archiving job, then it should be fine? One process starts 3 because it's the earliest idinvalidation. Another picks up 5 and ignores it because 3 is in progress, but since 5 > 3, it's kept in the table.

Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?

Ok, understanding this better... we could probably do idinvalidation DESC... will think a bit about whether that would be an issue...

yeah the issue be that when

invalidation 3: month sep 2020 invalidation 4: month sep 2020 invalidation 5: day sep 2020 14 invalidation 6: month sep 2020

it works on 3, a parallel process would pick 4. It would notice it's higher than 3 and skip it but not delete it. So it then never actually deletes any of the duplicate ones. On the other side we also cannot fetch idinvalidation desc because we need to make sure after invalidation 5 the month will be reprocessed again.

The only way to prevent this is not to do a bulk insert and instead before the insert delete any already existing entry for the same period but not even sure if this could have any edge cases/race conditions (performance wise it should be fine considering there's an index on idsite,date1,period but would of course result in an additional query for each insert which isn't exactly great but maybe needed not sure). Although I reckon it should work because things are sorted by ORDER BY date1 DESC, period ASC and it should always archive first any lower invalidated period before archiving the higher period.

eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.

so in this scenario:

invalidation 4 (day) would be found first and archived (we order by period asc too)

then invalidation 5, (we'd perhaps delete all invalidations below 5 just after starting it)

then we're fine. I guess it's more complicated when invalidations are added in between archiving. For example:

invalidation 4 (day)

start on invalidation 5

meanwhile an invalidation for sep 14 is added back

after invalidation 5 is done, we see sep 14 and re-archive

the month is no longer accurate

Of course this would only be an issue if somehow sep 14 is invalidated, but no parent periods as well... which isn't the case unless there's a bug. In this case it would still work I think.

Will try w/ idinvalidation DESC since it definitely won't work right w/ idinvalidation ASC.

👍 it probably works with DESC since it should archive any lower period first anyway

Another scenario:

invalidation 5 is started

new month invalidation comes in

we finish 5, see new month one. it's too soon since the last archive so we don't start it, but we don't remove it

this seems ok to me too?

tsteur · 2020-09-14T08:01:48Z

btw was just checking the invalidations on the demo. There seem to be heaps of invalidations at the same second. It's idSite 60 which is a roll up.

this is for year

not sure if that's a different issue or not. It should be fixed though kind of with this issue but ideally it wasn't creating that many entries in the first place maybe. That's demo, not demo2.

This is for months

diosmosis · 2020-09-14T08:06:14Z

@tsteur that's strange... not sure what would invalidate so much... I'll debug core:archive on demo.

diosmosis · 2020-09-14T09:38:15Z

can't reproduce on demo by executing core:archive manually... will try something else

diosmosis · 2020-09-14T11:09:43Z

I think this has something to do w/ invalidating today, then invalidating yesterday. There are two week period duplicates w/ the same timestamp and 6 month period duplicates also w/ the same timestamp. Not sure why there are two weeks and 6 months though...

diosmosis · 2020-09-15T11:28:40Z

@tsteur finally found the cause of the issue, it's due to RollUpReporting handling Archiving.getIdSitesToMarkArchivesAsInvalidated. The same site gets invalidated for every child site that's invalidated. Not sure what the best fix is yet.

tsteur · 2020-09-15T22:21:20Z

@diosmosis ok thanks. I thought I had a fix for this but didn't quite work out (in the rollupchildreninvalidate branch in roll ups repo). Generally, if a site had new traffic, and it is part of a roll up, then we also have to invalidate the roll ups. In all other cases there should be no need to invalidate the parent roll up. The only way I can think of to fix this right now is the hope that all duplicate entries would be removed eventually with this PR since we'd fetch the highest invalidation ID and skip and remove lower IDs. But this would only happen if another archiving job runs while archiving this one. Before inserting the invalidations could we maybe at least remove duplicates if that's an issue? Don't know if these would be inserted all at once or over different bulk inserts from different invalidations

diosmosis · 2020-09-16T03:24:37Z

But this would only happen if another archiving job runs while archiving this one.

This shouldn't be the case? When we get the highest ID, we start it and when successfully started, we remove all lower identical invalidations. All in the same archiving run.

diosmosis · 2020-09-16T03:34:53Z

Don't know if these would be inserted all at once or over different bulk inserts from different invalidations

There's an event in the invalidate method, so in each separate invalidate call, demo adds the same idSite, resulting in the same periods being invalidated. We could do a complex query before the batch insert to check whether every one we're looking for is already present, but that would make the method even more complex... would be easy to test though so I can add this.

tsteur · 2020-09-16T04:15:59Z

👍 The query shouldn't be needed. Was only meaning if it's in PHP in memory already.

diosmosis · 2020-09-16T04:43:42Z

Was only meaning if it's in PHP in memory already.

Ah I see... we could use the transient cache w/ a ttl (since of course core:archive would just keep using the cache). It would make debugging a bit cleaner so I can see why it would be useful.

…alidations scheduled

tsteur · 2020-09-20T22:27:40Z

plugins/CoreAdminHome/Tasks.php

@@ -112,8 +112,12 @@ public function invalidateOutdatedArchives()
            return;
        }

-        $cronArchive = new CronArchive();
-        $cronArchive->invalidateArchivedReportsForSitesThatNeedToBeArchivedAgain();
+        $idSites = Request::processRequest('SitesManager.getAllSites');


just fyi @diosmosis could have used getAllSitesId method but not too important. Won't need to change it

actually quickly changed it just fyi

Add extra check for queue consumer to prevent duplicates and add dupl…

7c1019b

…icate check in tests.

diosmosis added not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. Needs Review PRs that need a code review labels Sep 8, 2020

diosmosis added this to the 4.0.0 milestone Sep 8, 2020

tsteur reviewed Sep 8, 2020

View reviewed changes

diosmosis and others added 3 commits September 11, 2020 17:47

Do not process archives already in progress and conditionally keep c…

6ca3d58

…ertain duplicate invalidations. Also fix issue w/ done flag detection, should not process done flag types together otherwise segment can be done before day.

fix stash pop

353f8ac

Merge branch '4.x-dev' into extra-archive-invalidation-check

9818f17

tsteur reviewed Sep 14, 2020

View reviewed changes

fixing tests

b0bdda3

sort by idinvalidation DESC when ordering

ebabde6

Only need to use day since we propagate upwards.

bee38d9

diosmosis added 6 commits September 17, 2020 16:32

invalidate only for site being archived and only invalidate if no inv…

d7bfee1

…alidations scheduled

Merge branch '4.x-dev' into extra-archive-invalidation-check

ea1ee8f

fix tests and task

e27d885

update expected test files

606875a

get rid of random failure

1f6fb2f

update tests

2ce0644

more test fixes

7029597

tsteur reviewed Sep 20, 2020

View reviewed changes

fetch idsites directly

3535d7b

tsteur merged commit 1cb41a7 into 4.x-dev Sep 20, 2020

tsteur deleted the extra-archive-invalidation-check branch September 20, 2020 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

diosmosis commented Sep 8, 2020 •

edited

tsteur Sep 8, 2020

diosmosis Sep 8, 2020

tsteur Sep 8, 2020

diosmosis Sep 9, 2020

tsteur Sep 9, 2020

tsteur Sep 11, 2020

tsteur Sep 11, 2020

diosmosis Sep 11, 2020

tsteur Sep 11, 2020

diosmosis Sep 12, 2020

tsteur commented Sep 13, 2020

diosmosis commented Sep 13, 2020

tsteur Sep 14, 2020

diosmosis Sep 14, 2020

diosmosis Sep 14, 2020

tsteur Sep 14, 2020

diosmosis Sep 14, 2020

tsteur Sep 14, 2020

diosmosis Sep 14, 2020

tsteur commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 15, 2020

tsteur commented Sep 15, 2020

diosmosis commented Sep 16, 2020

diosmosis commented Sep 16, 2020

tsteur commented Sep 16, 2020

diosmosis commented Sep 16, 2020

tsteur Sep 20, 2020

tsteur Sep 20, 2020

Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

Conversation

diosmosis commented Sep 8, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented Sep 13, 2020

diosmosis commented Sep 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 14, 2020

diosmosis commented Sep 15, 2020

tsteur commented Sep 15, 2020

diosmosis commented Sep 16, 2020

diosmosis commented Sep 16, 2020

tsteur commented Sep 16, 2020

diosmosis commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diosmosis commented Sep 8, 2020 •

edited