Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initiate range archiving if an archive is invalidated, the request is from the browser, and browser archiving is authorized #17379

Merged
merged 26 commits into from Apr 12, 2021

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Mar 23, 2021

Description:

Currently, if a range archive is invalid, and the user views it, it will not re-archive. This is because ranges are generally archived well after the end date, so the ts_archived will always be later than the minimum processed time: https://github.com/matomo-org/matomo/blob/3.x-dev/core/ArchiveProcessor/Loader.php#L223-L236

I think, previously, if going through the "trigger archiving" code path (https://github.com/matomo-org/matomo/blob/4.x-dev/core/Archive.php#L571), we never looked for invalidated archives, so we'd always initiate archiving. This change happened in 3c2b5f7 to make sure we don't initiate archiving of invalidated archives under certain circumstances if we don't have to.

Also for some reason a test was removed in a commit (but the test data was still there)... I put it back.

FYI @tsteur

Review

  • Functional review done
  • Potential edge cases thought about (behavior of the code with strange input, with strange internal state or possible interactions with other Matomo subsystems)
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

@diosmosis diosmosis added Needs Review PRs that need a code review Regression Indicates a feature used to work in a certain way but it no longer does even though it should. labels Mar 23, 2021
@diosmosis diosmosis added this to the 4.3.0 milestone Mar 23, 2021
@tsteur
Copy link
Member

tsteur commented Mar 23, 2021

makes sense 👍

@diosmosis
Copy link
Member Author

@tsteur added two more tests: one where we track a visit in the past, run core:archive and make sure it incremented the data, and another where we track a visit in the past after archiving a range, then rearchive the range via a direct API request, and check that it updated.

Caught a bug in the first test. In

$minArchiveProcessedTime = $isPeriodIncludesToday ? Date::now()->subSeconds(Rules::getPeriodArchiveTimeToLiveDefault($params->getPeriod()->getLabel())) : null;
we check if there is an existing archive that we can use, and if so, we don't invalidate. This was for today archives since they would be invalidated often and we didn't want to initiate archiving if there was already one that was created within the TTL value.

Unfortunately, we default to null for other periods, so if we find any archive regardless of when it was created, we don't invalidate, and don't insert anything into the archive_invalidations table. The bug manifests like this:

  • data is archived (time 2020-03-23 12:00:00)
  • user tracks data in the past for a day (2012-08-09) (time 2020-03-23 22:30:00)
  • next core:archive, we see the day in the list to invalidate
  • we check whether there is currently an archive there. there is w/ DONE_OK, created 2020-03-23 12:00:00.
  • we don't invalidate because the archive is there, even though there's been a new visit
  • the data is not rearchived

@diosmosis
Copy link
Member Author

@tsteur actually just decided it was better to not perform the check if the date didn't include today. I'm thinking this would avoid edge cases when rearchiving data in the past (eg, report specific or segment archives). For periods that include today, they will eventually get archived again, so for (eg) a Funnel that is rearchived, we'd archive the months and days not including today, then a little while later the 'today' included archives (ie, day, week, month year) will get archived, including the old data.

This makes me think of some other edge cases w/ periods that include today as well, will think about them.

@@ -904,7 +904,11 @@ public function isThereExistingValidPeriod(Parameters $params)
$isYesterday = $params->getPeriod()->getLabel() == 'day' && $params->getPeriod()->getDateStart()->toString() == Date::factory('yesterday')->toString();

$isPeriodIncludesToday = $params->getPeriod()->isDateInPeriod($today);
$minArchiveProcessedTime = $isPeriodIncludesToday ? Date::now()->subSeconds(Rules::getPeriodArchiveTimeToLiveDefault($params->getPeriod()->getLabel())) : null;
if (!$isPeriodIncludesToday) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis can we maybe add a comment why we skip here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we could still keep the logic and it would just work the next time it runs after minArchiveProcessedTime has passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur I left a comment w/ my reasoning here: #17379 (comment)

I was thinking there could be weird edge cases where we rearchive say a month of data in the past, but only some of it gets archived. Eg, let's say we try to archive a specific funnel for the last month:

  • user invalidates a day from a week ago (or a visit is tracked in the past for that day)
  • archiver runs, the day archive from a week ago is rearchived (ts_archived = '2020-03-24 19:00:00')
  • user changes a funnel, last month is scheduled to rearchive
  • archiver runs, every day is invalidated except the day from a week ago. the data is archived but incorrect.

This isn't an issue I think for tracking in the past so we could keep it in that case, because if we don't invalidate, it's not removed from the "remember to invalidate" option values. So we'll do it again. But other types of invalidation it probably shouldn't be done.

Though only issue i can think of w/ this change for tracking in the past is if visits keep coming in, we keep invalidating, except we only do this once per site before archiving, so I don't think that will happen...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user changes a funnel, last month is scheduled to rearchive

do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran? Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran?

In this specific example, yes. I suspect there are many strange edge cases like this that could happen.

Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?

I guess it's better to use an example of last2 months, where there's no period that contains today, and the new visit is tracked in the month before last.

It would invalidate the adjacent days, and the week. When the week gets archived, it would see there is a usable day archive and just use it since value = DONE_OK and ts_archived is recent and date does not include today (if my reasoning is correct).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur took another look, and my reasoning for this was off, so removed the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could merge the remaining changes then once tests pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 will do, working on test fixes now

@tsteur
Copy link
Member

tsteur commented Mar 26, 2021

@diosmosis there are some tests failing. Is it maybe due to this PR?

@diosmosis
Copy link
Member Author

@tsteur fixed the test that was failing, the ArchiveInvalidationTest was kind of broken, fixed a bunch of things there and found an issue in Loader.php. W/ browser archiving, the invalidate doesn't happen early enough, so moved it. Will merge if tests pass on travis.

@diosmosis
Copy link
Member Author

@tsteur fixed the tests and modified invalidation in CronArchive.php again to do so. Here is how the invalidation uses the ttl:

  • if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.
  • if doing scheduled rearchiving, we don't perform the check and just make sure the archives are invalidated so they will be processed
  • if invalidating 'today' (or the period includes today), we perform the TTL check to make sure it is archived again
  • if invalidating 'yesterday', we don't perform the TTL check and look for any archive (this is ONLY for the invalidateRecentDate function which is there to check if the date has changed since yesterday)

Does this sound right to you?

@tsteur
Copy link
Member

tsteur commented Mar 30, 2021

@diosmosis sounds right. I checked the documentation in https://github.com/matomo-org/matomo/blob/4.2.0/config/global.ini.php#L323-L340 again and it is indeed only applied to periods including today.

if data was tracked to a past date, then we check that if there was a recent archive within the TTL, if so we use it. next core:archive run we'd check again.

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

@diosmosis
Copy link
Member Author

@tsteur

so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly.

does that make sense?

It makes sense. I left it, since in this case, it should be fine to have the TTL check (we will always try and invalidate again). It might also be useful if a user tracks lots of data in the past continuously? Like the cloud user that tracks data to yesterday instead of today. Then if the ttl is 6hrs, but we run core:archive every hour, maybe we'd still want to wait until the archives are out of date before invalidating?

@tsteur
Copy link
Member

tsteur commented Mar 30, 2021

Yes that's good 👍

…child archive or when all child archives are usable while still respecting ttl for periods that include today
@diosmosis
Copy link
Member Author

diosmosis commented Apr 1, 2021

@tsteur last commit will make sure ranges are only force archived in the following situations:

  • range is invalid and we are allowed to archive day archives in the same request
  • range is invalid and we are not allowed to archive day archives in same request, but every child archive is usable (not invalid)

If the range contains today, then we rely only on the archive becoming too old (so we respect the TTL).

I also made sure the same logic worked when browser triggered archiving of segments was allowed.

I tried to make the logic as clear as possible and the added query is only executed if all other checks fail and only for range archives. Added several new tests as well. Need to add a couple more to ArchiveSelectorTest if this all makes sense to you.

When reviewing the pr I noticed the new test I wrote previously in CronArchiveTest was incorrect, so found these edge cases this way.

// made invalid, we will correctly re-archive below.
if ($this->invalidateBeforeArchiving) {
$this->invalidatedReportsIfNeeded();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Invalidating before we look for an archive, otherwise, the invalidation won't happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports sets this to true when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsteur I think it's required for browser archiving to support the "tracking data in the past" feature. Otherwise, the invalidations that get queued in the option table never get applied. We could add isBrowserArchivingEnabled as a safety. I don't think we can restrict to period because it's meant to execute on every query (basically it just does this: "are there queued invalidations? if so, let's invalidate before we launch archiving"). Since it's just for browser archiving I don't think we have to care too much about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis this could be still executed though when browser archiving is disabled right? And then because archive php is not triggered it might invalidate them? Might be good to test this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I think in the old code we didn't invalidate in core:archive, so maybe it was all done here... I'll add the check for browser archiving being enabled.

if ($endDateTimestamp) {
// past archive
return $endDateTimestamp;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if we see a range period and we are allowed to archive and we're in a browser request, we check the ttl no matter what.

So if range archiving via browser is forced, we always archive in the browser every time the ttl is no longer valid.

If range archiving is not allowed via browser, we use the default logic (if the range does not include today, use latest archive that was archived after end date). If it includes today, we check the TTL using the existing code.

@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite)
'date' => $date->getDatetime(),
]);

$this->invalidateWithSegments([$idSite], $date->toString(), 'day');
$isYesterday = $dateStr == 'yesterday';
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this and below code skips a particular check if we are invalidating yesterday (not a date including yesterday, just yesterday; we do this like we're invalidating today in case the day has changed and we want to initiate archiving for yesterday).

Since we only care if the day changed since the archive was created, we want to check if there's any archive available that we can use, so we don't want to do a ttl check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we maybe also add inline a comment for this explaining this behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if (!empty($idArchives)
&& !$this->params->getArchiveOnlyReport()
&& !Rules::isForceArchivingSinglePlugin()
&& !$this->shouldForceInvalidatedArchive($value, $tsArchived)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: if an archive is found to be invalidated, we check if we should force archiving. This check was originally in ArchiveSelector, but I moved it since ArchiveSelector is used in multiple places now.

@diosmosis
Copy link
Member Author

@tsteur made some changes ready for another review

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis looks good so far. Added a few comments that may need looking at.

Also I'm thinking it may be useful to add a new method isArchivingEnabledFor like below since there are a lot of ! and it makes the code maybe more easily understandable. I'm sure we could use that method eventually in a lot of places and change a lot of isArchivingDisabledFor calls

diff --git a/core/ArchiveProcessor/Loader.php b/core/ArchiveProcessor/Loader.php
index 4e792338cc..f79714e882 100644
--- a/core/ArchiveProcessor/Loader.php
+++ b/core/ArchiveProcessor/Loader.php
@@ -275,7 +275,7 @@ class Loader
     {
         // for range periods we can archive in a browser request request, make sure to check for the ttl no matter what
         $isRangeArchiveAndArchivingEnabled = $this->params->getPeriod()->getLabel() == 'range'
-            && !Rules::isArchivingDisabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
+            && Rules::isArchivingEnabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
 
         if (!$isRangeArchiveAndArchivingEnabled) {
             $endDateTimestamp = self::determineIfArchivePermanent($this->params->getDateEnd());
@@ -444,7 +444,7 @@ class Loader
 
         // the archive is invalidated and we are in a browser request that is allowed archive it
         if ($value == ArchiveWriter::DONE_INVALIDATED
-            && !Rules::isArchivingDisabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
+            && Rules::isArchivingEnabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
         ) {
             // if coming from core:archive, force rearchiving, since if we don't the entry will be removed from archive_invalidations
             // w/o being rearchived
diff --git a/core/ArchiveProcessor/Rules.php b/core/ArchiveProcessor/Rules.php
index 947611e5c5..27d505a7ea 100644
--- a/core/ArchiveProcessor/Rules.php
+++ b/core/ArchiveProcessor/Rules.php
@@ -197,6 +197,11 @@ class Rules
         return !$generalConfig['browser_archiving_disabled_enforce'];
     }
 
+    public static function isArchivingEnabledFor(array $idSites, Segment $segment, $periodLabel)
+    {
+        return !self::isArchivingDisabledFor($idSites, $segment, $periodLabel);
+    }
+
     public static function isArchivingDisabledFor(array $idSites, Segment $segment, $periodLabel)
     {
         $generalConfig = Config::getInstance()->General;

// made invalid, we will correctly re-archive below.
if ($this->invalidateBeforeArchiving) {
$this->invalidatedReportsIfNeeded();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports sets this to true when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled).

}

// if coming from a browser request, and period does not contain today, force rearchiving
if (!$params->getPeriod()->isDateInPeriod(Date::factory('today'))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add the timezone to today depending on the idsite maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite)
'date' => $date->getDatetime(),
]);

$this->invalidateWithSegments([$idSite], $date->toString(), 'day');
$isYesterday = $dateStr == 'yesterday';
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we maybe also add inline a comment for this explaining this behaviour?

@diosmosis
Copy link
Member Author

@tsteur applied review feedback

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants