@diosmosis opened this Pull Request on March 23rd 2021 Member

Description:

Currently, if a range archive is invalid, and the user views it, it will not re-archive. This is because ranges are generally archived well after the end date, so the ts_archived will always be later than the minimum processed time: https://github.com/matomo-org/matomo/blob/3.x-dev/core/ArchiveProcessor/Loader.php#L223-L236

I think, previously, if going through the "trigger archiving" code path (https://github.com/matomo-org/matomo/blob/4.x-dev/core/Archive.php#L571), we never looked for invalidated archives, so we'd always initiate archiving. This change happened in https://github.com/matomo-org/matomo/commit/3c2b5f714f85457796e3e190781d3e592822ebf7 to make sure we don't initiate archiving of invalidated archives under certain circumstances if we don't have to.

Also for some reason a test was removed in a commit (but the test data was still there)... I put it back.

FYI @tsteur

Review

  • [ ] Functional review done
  • [ ] Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • [ ] Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • [ ] Security review done see checklist
  • [ ] Code review done
  • [ ] Tests were added if useful/possible
  • [ ] Reviewed for breaking changes
  • [ ] Developer changelog updated if needed
  • [ ] Documentation added if needed
  • [ ] Existing documentation updated if needed
@tsteur commented on March 23rd 2021 Member

makes sense 👍

@diosmosis commented on March 24th 2021 Member

@tsteur added two more tests: one where we track a visit in the past, run core:archive and make sure it incremented the data, and another where we track a visit in the past after archiving a range, then rearchive the range via a direct API request, and check that it updated.

Caught a bug in the first test. In https://github.com/matomo-org/matomo/blob/2b35b46a0988f8c3b6f848c59be8d7d918965a52/core/CronArchive.php#L907 we check if there is an existing archive that we can use, and if so, we don't invalidate. This was for today archives since they would be invalidated often and we didn't want to initiate archiving if there was already one that was created within the TTL value.

Unfortunately, we default to null for other periods, so if we find any archive regardless of when it was created, we don't invalidate, and don't insert anything into the archive_invalidations table. The bug manifests like this:

  • data is archived (time 2020-03-23 12:00:00)
  • user tracks data in the past for a day (2012-08-09) (time 2020-03-23 22:30:00)
  • next core:archive, we see the day in the list to invalidate
  • we check whether there is currently an archive there. there is w/ DONE_OK, created 2020-03-23 12:00:00.
  • we don't invalidate because the archive is there, even though there's been a new visit
  • the data is not rearchived
@diosmosis commented on March 24th 2021 Member

@tsteur actually just decided it was better to not perform the check if the date didn't include today. I'm thinking this would avoid edge cases when rearchiving data in the past (eg, report specific or segment archives). For periods that include today, they will eventually get archived again, so for (eg) a Funnel that is rearchived, we'd archive the months and days not including today, then a little while later the 'today' included archives (ie, day, week, month year) will get archived, including the old data.

This makes me think of some other edge cases w/ periods that include today as well, will think about them.

@tsteur commented on March 26th 2021 Member

@diosmosis there are some tests failing. Is it maybe due to this PR?

@diosmosis commented on March 29th 2021 Member

@tsteur fixed the test that was failing, the ArchiveInvalidationTest was kind of broken, fixed a bunch of things there and found an issue in Loader.php. W/ browser archiving, the invalidate doesn't happen early enough, so moved it. Will merge if tests pass on travis.

@diosmosis commented on March 30th 2021 Member

@tsteur fixed the tests and modified invalidation in CronArchive.php again to do so. Here is how the invalidation uses the ttl:

  • if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.
  • if doing scheduled rearchiving, we don't perform the check and just make sure the archives are invalidated so they will be processed
  • if invalidating 'today' (or the period includes today), we perform the TTL check to make sure it is archived again
  • if invalidating 'yesterday', we don't perform the TTL check and look for any archive (this is ONLY for the invalidateRecentDate function which is there to check if the date has changed since yesterday)

Does this sound right to you?

@tsteur commented on March 30th 2021 Member

@diosmosis sounds right. I checked the documentation in https://github.com/matomo-org/matomo/blob/4.2.0/config/global.ini.php#L323-L340 again and it is indeed only applied to periods including today.

if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

@diosmosis commented on March 30th 2021 Member

@tsteur

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

It makes sense. I left it, since in this case, it should be fine to have the TTL check (we will always try and invalidate again). It might also be useful if a user tracks lots of data in the past continuously? Like the cloud user that tracks data to yesterday instead of today. Then if the ttl is 6hrs, but we run core:archive every hour, maybe we'd still want to wait until the archives are out of date before invalidating?

@tsteur commented on March 30th 2021 Member

Yes that's good 👍

@diosmosis commented on April 1st 2021 Member

@tsteur last commit will make sure ranges are only force archived in the following situations:

  • range is invalid and we are allowed to archive day archives in the same request
  • range is invalid and we are not allowed to archive day archives in same request, but every child archive is usable (not invalid)

If the range contains today, then we rely only on the archive becoming too old (so we respect the TTL).

I also made sure the same logic worked when browser triggered archiving of segments was allowed.

I tried to make the logic as clear as possible and the added query is only executed if all other checks fail and only for range archives. Added several new tests as well. Need to add a couple more to ArchiveSelectorTest if this all makes sense to you.

When reviewing the pr I noticed the new test I wrote previously in CronArchiveTest was incorrect, so found these edge cases this way.

@diosmosis commented on April 7th 2021 Member

@tsteur made some changes ready for another review

@diosmosis commented on April 10th 2021 Member

@tsteur applied review feedback

This Pull Request was closed on April 12th 2021
Powered by GitHub Issue Mirror