@diosmosis opened this Pull Request on December 1st 2020 Member

Description:

As title. Race conditions could still exist causing duplicates, but they should be much rarer. In this case I don't think we need to delete after finalizing an archive.
FYI @tsteur

Review

  • [ ] Functional review done
  • [ ] Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • [ ] Security review done see checklist
  • [ ] Code review done
  • [ ] Tests were added if useful/possible
  • [ ] Reviewed for breaking changes
  • [ ] Developer changelog updated if needed
  • [ ] Documentation added if needed
  • [ ] Existing documentation updated if needed
@sgiehl commented on December 1st 2020 Member

@diosmosis the code looks fine, but some tests are failing.
Also I'm not sure how to test that properly. Are there any specific test cases that should be fixed now?

@diosmosis commented on December 1st 2020 Member

@sgiehl there are changes to tests for this. If an invalidation already exists in the table, we don't insert another.

@tsteur commented on December 2nd 2020 Member

@diosmosis is this purely to reduce IO? I'm thinking the query in getExistingInvalidations might in the end cause more IO than just inserting the entry and later deleting it (not 100% sure though) . We could wait with this PR to see in production if IO is an issue or not.

@diosmosis commented on December 2nd 2020 Member

@tsteur

is this purely to reduce IO?

Yes, mostly. Alternatively instead of doing the GROUP BY we could check for each invalidation we want to insert (after determining whether it is a problem or not). The most common case will be finding at most one row per invalidation to insert.

It's also there to make sure we don't have duplicates in the table (so we don't have to worry about it at all). Either this PR or deleting after writing archives is needed. (I think we have to implement one of them at least to avoid too many dupes.)

@tsteur commented on December 2nd 2020 Member

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc. We could merge it though and try to monitor impact.

I wonder will it cause an issue that we have these invalidations:

  • month Feb 2020

Then we invalidate day for 20-02-20 and Feb 2020

Then it would not add "feb 2020" because it already exists

Then an existing archive run would start archiving "feb 2020" (because there is no more day archive because ts_invalidated<queueConsumerIdSiteTime). This finishes

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

I'm thinking there could be issues.

A similar issue could happen when removing duplicates on archive finalize I think.
And so far the only reliable way I found to avoid these kind of things was isInvalidationsScheduledForSite and to basically indirectly invalidate less often :(

@diosmosis commented on December 2nd 2020 Member

@tsteur

It's just that it basically fetches all invalidations for 1 site and if I see this correctly it does that for each date that will be invalidated etc.

It does a group by so it shouldn't select duplicates, and we could also restrict it to the periods requested as well.

Then another archiver would start the day but there is no Feb 2020 invalidation that would need to be done after the day invalidation?

This seems like an issue too... but I think it would be solved if we kept the latest duplicate?

@diosmosis commented on December 2nd 2020 Member

@tsteur wait is the referenced issue actually an issue? (trying to understand)

1) month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
2) Then we invalidate day for 20-02-20 and Feb 2020, now we have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020, ts_invalidated = 2020-12-01 03:04:00]
3) archiver that is in process or new starts on this idsite (w/ now = 2020-12-01 03:05:30, eg (it will be after the invalidation in step 2 in this scenario))
4) archiver process day + month just fine

in this case it looks like it's fine, but if we're in the middle of an archive that might be an issue, perhaps? like:

1) month Feb 2020 already exists (because archiver was terminated or something) [ts_invalidated = 2020-12-01 03:04:00]
2) archiver starts processing at 2020-12-01 03:04:30, feb 2020 is worked on but does not finish
3) Then we invalidate day for 20-02-20 and Feb 2020, now we have feb 2020 (original) [status = 1, ts_invalidated = 2020-12-01 03:04:00], 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
4) the archiver finishes and we just have 20-02-20 [ts_invalidated = 2020-12-01 03:05:00], [Feb 2020 [ts_invalidated = 2020-12-01 03:05:00]
5) the archiver runs again and everything is fine

ok another one:

1) 02-20-2020 + Feb 2020 already exists (because archiver was terminated or something) 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
2) archiver starts and picks up day, so 02-20-2020 [ts_invalidated = 2020-12-01 03:04:00, status = 1], Feb 2020 [ts_invalidated = 2020-12-01 03:04:00]
3) for some reason, feb 2020 by itself is invalidated (it's possible to do just a month so this can happen), but nothing is inserted
4) archive for day finishes and feb 2020 starts and finishes. this is all fine since the month was processed after the invalidation was requested. still no issue.

Seems like the main possible issues are tiny race conditions (something getting checked and inserted before status is set to 1), or the archive_invalidations table being in an inconsistent state (w/ periods being invalidated w/o other periods also being invalidated). It's not an issue I think for larger periods w/o smaller periods, in this case we just use the smaller periods as they are, and if they are invalidated later, we end up processing the larger periods with them no matter what. It's just an issue if there is a smaller period, and we don't handle larger periods (like day put down, but month/week/year are not), causing inconsistent data. Am I right in thinking this?

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Does any of this seem legit?

@diosmosis commented on December 2nd 2020 Member

One other issue is users using FixedSiteIds while SharedSiteIds is in progress. Would be good to not allow multiple sites to be archived together at least through core:archive, I think...

@tsteur commented on December 2nd 2020 Member

If this is right, we could maybe attempt to fix the table as we go along. Ie, we see a day, then we make sure there's a week/month/year added (hopefully quick SELECT for the periods, then insert if we have to, though it shouldn't happen that often). Or we see a week, and check for month/year. Only thing that wouldn't necessarily work are ranges, but I think that's normal unless they are totally blocked from being archived through the browser.

Not sure. It seems things are getting more and more complicated and so hard to slowly tell what side effects it causes.

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry. We'd basically want to avoid having 3 or more duplicates. There then might be still issues and race conditions but it might lower the risk. Low risk might not be good enough though. I think it be fine though as a temporary workaround until we find something better.

Generally I'm mostly mentioning this because of this recent change https://github.com/matomo-org/matomo/pull/16844/files#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R718 which could cause such race conditions potentially easily.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation. When the week finishes we insert the month invalidation etc. Not sure how feasible this is.

@diosmosis commented on December 2nd 2020 Member

Generally I think we can proceed for now with this duplicate check but we should maybe only skip the insertion if there's more than 1 existing entry.

This is what I was thinking as well.

I'm not sure if you meant this in the previous comment but maybe we simply need to create the week/month/year invalidations on demand. Meaning when a day archive finishes, then we insert the week invalidation.

Yes this is what I meant in my last comment.

@tsteur commented on December 2nd 2020 Member

Yes this is what I meant in my last comment.

great, yes I think that would be great.

@diosmosis commented on December 8th 2020 Member

@tsteur made some changes (applied status = 0 + added more conditions to select less data)

@tsteur commented on December 8th 2020 Member

OK that should work re the periods and name in the query. Wondering though if the where query could get quite large? Wonder if we should ignore the period if it's longer than say 1000 character or so. We could ignore this though for now and do something when it actually causes an issue. Might be good to have some upper limit though. But then we might fetch a lot of data so I guess either way is fine.

@diosmosis commented on December 8th 2020 Member

@tsteur we do one query per archive table in ArchiveInvalidator so there should already be an effective upper limit on the query length based on max number of periods in a month (31 + 4 + 1 + 1 I think).

@tsteur commented on December 8th 2020 Member

👍 great, feel free to merge once tests pass

This Pull Request was closed on December 8th 2020
Powered by GitHub Issue Mirror