Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra check for queue consumer to prevent duplicates and add duplicate check in tests. #16406

Merged
merged 15 commits into from Sep 20, 2020

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Sep 8, 2020

@tsteur here is the extra check. I didn't add a check when inserting since we're using load data infile so removing duplicates is complicated w/o a UNIQUE index.

You can look at the queue consumer test to verify.

@diosmosis diosmosis added not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. Needs Review PRs that need a code review labels Sep 8, 2020
@diosmosis diosmosis added this to the 4.0.0 milestone Sep 8, 2020
@@ -201,6 +201,13 @@ public function getNextArchivesToProcess()
continue;
}

if ($this->model->isSimilarArchiveInProgress($invalidatedArchive)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis I now wonder if things need to be more complicated... like do we need to know the "Archive start time" of that archive that is already in progress and only cancel invalidations that were added to the invalidations table before the start time?

Eg

  • add invalidation 1 for Month Sep 2020 at 7AM
  • add invalidation 2 for Month Sep 2020 at 8AM
  • add invalidation 3 for Month Sep 2020 at 9AM
  • start archiving Month Sep 2020 at 09:30 AM
  • add invalidation 4 for Month Sep 2020 at 10AM
  • add invalidation 5 for Month Sep 2020 at 11AM
  • archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
  • Meaning 1,2,3 would be removed, and 4 and 5 would be just skipped here?

Not sure if that's possible or if I'm missing something.

This way we make sure to start an archive again after that one finished and include all data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what would currently happen:

  • add invalidation 1 for Month Sep 2020 at 7AM
  • add invalidation 2 for Month Sep 2020 at 8AM
  • add invalidation 3 for Month Sep 2020 at 9AM
  • start archiving Month Sep 2020 at 09:30 AM
  • core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and ignores it
  • core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and ignores it
  • something adds invalidation 4 for Month Sep 2020 at 10AM
  • something add invalidation 5 for Month Sep 2020 at 11AM
  • archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
  • core:archive pulls the next archive, sees invalidation 2. it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).
  • same thing happens for 2-5
  • next hourly cron, invalidation 2 is picked up. and the process repeats.

w/ this PR:

  • add invalidation 1 for Month Sep 2020 at 7AM
  • add invalidation 2 for Month Sep 2020 at 8AM
  • add invalidation 3 for Month Sep 2020 at 9AM
  • start archiving Month Sep 2020 at 09:30 AM
  • core:archive picks up invalidation 2, tries to lock or sees that there is an existing one and discards it
  • core:archive picks up invalidation 3, tries to lock or sees that there is an existing one and discards it
  • something adds invalidation 4 for Month Sep 2020 at 10AM
  • something add invalidation 5 for Month Sep 2020 at 11AM
  • archive finishes... we should now remove the IDs 1,2,3 but maybe keep 4 and 5 or at least only 5?
  • core:archive pulls the next archive, sees invalidation 4. there is none in progress, so it passes that check. then it checks the archive's ts_archived, sees that it was finished moments ago and ignores it but does not delete it. (see QueueConsumer::usableArchiveExists() and where it is called).
  • same thing happens to invalidation 5.
  • next hourly cron, invalidation 4 is picked up. it is started.
  • invalidation 5 is seen, since invalidation 4 is in progress, it is discarded.
  • invalidation 4 is finished and there are no more invalidations left.

I don't think involving time is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure I'm fully understanding it @diosmosis

If there are multiple archivers launching at the very same time (which we support), or if there are more archivers starting between 9:30am (start of the archive) and 11:30am (finish of archive). Would one of the other archiving jobs maybe try to process invalidation 4 and 5 and remove them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur if you mean multiple core:archive processes, then no, because we still use the site ID queue (ie, SharedSiteIds), so we shouldn't be archiving the same site at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's still the case but I assume it might be. What would happen is that when all sites have been processed, the queue would reset and all sites would be added again. For simplicity lets say there's only one site. The 1 site would be taken out of the queue thus the queue be then empty. The next time the cron runs it would add idSite=1 again. So they might run in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis I just double checked and the same site might be processed concurrently in 3.x eg if you use SharedSiteIds or if you use force-idsites=X.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it complicated to add @diosmosis ? It would certainly prevent possible random issues of not rearchived data having this there I would say

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it + a fix that the test picked up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis did you push the fix? Only seeing 1 commit so far

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur pushed to the wrong branch, should be here now

diosmosis and others added 3 commits September 11, 2020 17:47
…ertain duplicate invalidations. Also fix issue w/ done flag detection, should not process done flag types together otherwise segment can be done before day.
@tsteur
Copy link
Member

tsteur commented Sep 13, 2020

@diosmosis generally looks good to merge. Can you look at the failing tests before merging?

There's a test failing because now the output of the archiving changes.

Not sure if https://travis-ci.org/github/matomo-org/matomo/jobs/726843740#L717 is related to this change

@diosmosis
Copy link
Member Author

Not sure if https://travis-ci.org/github/matomo-org/matomo/jobs/726843740#L717 is related to this change

That's random, we should probably change how we assert there, we just need to check it's different than before. Will look into it.

@@ -229,6 +227,18 @@ public function getNextArchivesToProcess()
continue;
}

$alreadyInProgressId = $this->model->isArchiveAlreadyInProgress($invalidatedArchive);
if ($alreadyInProgressId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I now wonder if this would actually work because we order the invalidations by ORDER BY date1 DESC, period ASC, CHAR_LENGTH(name) ASC, idinvalidation ASC. Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?

On the other side we can't use the highest ID because it would maybe remove too many.

eg

  • invalidation 3: month sep 2020
  • invalidation 4: day sep 2020 14
  • invalidation 5: month sep 2020

eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.

Not sure it's clear what I mean but selecting the highest ID maybe wouldn't work either and could generally lead to maybe wrong data.

Just wanted to check in if that's actually an issue that we select the lowest ID and then this would maybe not work. Not sure though might not be an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur if this is about keeping different machines/processes from doing the same archiving job, then it should be fine? One process starts 3 because it's the earliest idinvalidation. Another picks up 5 and ignores it because 3 is in progress, but since 5 > 3, it's kept in the table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean if there are like 5 queued for the same period, that it would pick the one with the lowest ID? so it wouldn't delete the other ones maybe because they are all lower?

Ok, understanding this better... we could probably do idinvalidation DESC... will think a bit about whether that would be an issue...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the issue be that when

invalidation 3: month sep 2020
invalidation 4: month sep 2020
invalidation 5: day sep 2020 14
invalidation 6: month sep 2020

it works on 3, a parallel process would pick 4. It would notice it's higher than 3 and skip it but not delete it. So it then never actually deletes any of the duplicate ones. On the other side we also cannot fetch idinvalidation desc because we need to make sure after invalidation 5 the month will be reprocessed again.

The only way to prevent this is not to do a bulk insert and instead before the insert delete any already existing entry for the same period but not even sure if this could have any edge cases/race conditions (performance wise it should be fine considering there's an index on idsite,date1,period but would of course result in an additional query for each insert which isn't exactly great but maybe needed not sure). Although I reckon it should work because things are sorted by ORDER BY date1 DESC, period ASC and it should always archive first any lower invalidated period before archiving the higher period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg if it was to select invalidation 5 first (and thus ignore and delete invalidation 3), and it archives month, and then it archives invalidation 4 the day period, then the month would maybe never be rearchived.

so in this scenario:

  • invalidation 4 (day) would be found first and archived (we order by period asc too)
  • then invalidation 5, (we'd perhaps delete all invalidations below 5 just after starting it)

then we're fine. I guess it's more complicated when invalidations are added in between archiving. For example:

  • invalidation 4 (day)
  • start on invalidation 5
  • meanwhile an invalidation for sep 14 is added back
  • after invalidation 5 is done, we see sep 14 and re-archive
  • the month is no longer accurate

Of course this would only be an issue if somehow sep 14 is invalidated, but no parent periods as well... which isn't the case unless there's a bug. In this case it would still work I think.

Will try w/ idinvalidation DESC since it definitely won't work right w/ idinvalidation ASC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it probably works with DESC since it should archive any lower period first anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario:

  • invalidation 5 is started
  • new month invalidation comes in
  • we finish 5, see new month one. it's too soon since the last archive so we don't start it, but we don't remove it

this seems ok to me too?

@tsteur
Copy link
Member

tsteur commented Sep 14, 2020

btw was just checking the invalidations on the demo. There seem to be heaps of invalidations at the same second. It's idSite 60 which is a roll up.

this is for year
image

not sure if that's a different issue or not. It should be fixed though kind of with this issue but ideally it wasn't creating that many entries in the first place maybe. That's demo, not demo2.

This is for months
image

@diosmosis
Copy link
Member Author

@tsteur that's strange... not sure what would invalidate so much... I'll debug core:archive on demo.

@diosmosis
Copy link
Member Author

can't reproduce on demo by executing core:archive manually... will try something else

@diosmosis
Copy link
Member Author

I think this has something to do w/ invalidating today, then invalidating yesterday. There are two week period duplicates w/ the same timestamp and 6 month period duplicates also w/ the same timestamp. Not sure why there are two weeks and 6 months though...

@diosmosis
Copy link
Member Author

@tsteur finally found the cause of the issue, it's due to RollUpReporting handling Archiving.getIdSitesToMarkArchivesAsInvalidated. The same site gets invalidated for every child site that's invalidated. Not sure what the best fix is yet.

@tsteur
Copy link
Member

tsteur commented Sep 15, 2020

@diosmosis ok thanks. I thought I had a fix for this but didn't quite work out (in the rollupchildreninvalidate branch in roll ups repo). Generally, if a site had new traffic, and it is part of a roll up, then we also have to invalidate the roll ups. In all other cases there should be no need to invalidate the parent roll up. The only way I can think of to fix this right now is the hope that all duplicate entries would be removed eventually with this PR since we'd fetch the highest invalidation ID and skip and remove lower IDs. But this would only happen if another archiving job runs while archiving this one. Before inserting the invalidations could we maybe at least remove duplicates if that's an issue? Don't know if these would be inserted all at once or over different bulk inserts from different invalidations
image

@diosmosis
Copy link
Member Author

But this would only happen if another archiving job runs while archiving this one.

This shouldn't be the case? When we get the highest ID, we start it and when successfully started, we remove all lower identical invalidations. All in the same archiving run.

@diosmosis
Copy link
Member Author

Don't know if these would be inserted all at once or over different bulk inserts from different invalidations

There's an event in the invalidate method, so in each separate invalidate call, demo adds the same idSite, resulting in the same periods being invalidated. We could do a complex query before the batch insert to check whether every one we're looking for is already present, but that would make the method even more complex... would be easy to test though so I can add this.

@tsteur
Copy link
Member

tsteur commented Sep 16, 2020

👍 The query shouldn't be needed. Was only meaning if it's in PHP in memory already.

@diosmosis
Copy link
Member Author

Was only meaning if it's in PHP in memory already.

Ah I see... we could use the transient cache w/ a ttl (since of course core:archive would just keep using the cache). It would make debugging a bit cleaner so I can see why it would be useful.

@@ -112,8 +112,12 @@ public function invalidateOutdatedArchives()
return;
}

$cronArchive = new CronArchive();
$cronArchive->invalidateArchivedReportsForSitesThatNeedToBeArchivedAgain();
$idSites = Request::processRequest('SitesManager.getAllSites');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fyi @diosmosis could have used getAllSitesId method but not too important. Won't need to change it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually quickly changed it just fyi

@tsteur tsteur merged commit 1cb41a7 into 4.x-dev Sep 20, 2020
@tsteur tsteur deleted the extra-archive-invalidation-check branch September 20, 2020 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants