New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initiate range archiving if an archive is invalidated, the request is from the browser, and browser archiving is authorized #17379
Conversation
…the browser, and browser archiving for the current request is authorized.
makes sense 👍 |
@tsteur added two more tests: one where we track a visit in the past, run core:archive and make sure it incremented the data, and another where we track a visit in the past after archiving a range, then rearchive the range via a direct API request, and check that it updated. Caught a bug in the first test. In Line 907 in 2b35b46
Unfortunately, we default to
|
@tsteur actually just decided it was better to not perform the check if the date didn't include today. I'm thinking this would avoid edge cases when rearchiving data in the past (eg, report specific or segment archives). For periods that include today, they will eventually get archived again, so for (eg) a Funnel that is rearchived, we'd archive the months and days not including today, then a little while later the 'today' included archives (ie, day, week, month year) will get archived, including the old data. This makes me think of some other edge cases w/ periods that include today as well, will think about them. |
core/CronArchive.php
Outdated
@@ -904,7 +904,11 @@ public function isThereExistingValidPeriod(Parameters $params) | |||
$isYesterday = $params->getPeriod()->getLabel() == 'day' && $params->getPeriod()->getDateStart()->toString() == Date::factory('yesterday')->toString(); | |||
|
|||
$isPeriodIncludesToday = $params->getPeriod()->isDateInPeriod($today); | |||
$minArchiveProcessedTime = $isPeriodIncludesToday ? Date::now()->subSeconds(Rules::getPeriodArchiveTimeToLiveDefault($params->getPeriod()->getLabel())) : null; | |||
if (!$isPeriodIncludesToday) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis can we maybe add a comment why we skip here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically we could still keep the logic and it would just work the next time it runs after minArchiveProcessedTime
has passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur I left a comment w/ my reasoning here: #17379 (comment)
I was thinking there could be weird edge cases where we rearchive say a month of data in the past, but only some of it gets archived. Eg, let's say we try to archive a specific funnel for the last month:
- user invalidates a day from a week ago (or a visit is tracked in the past for that day)
- archiver runs, the day archive from a week ago is rearchived (ts_archived = '2020-03-24 19:00:00')
- user changes a funnel, last month is scheduled to rearchive
- archiver runs, every day is invalidated except the day from a week ago. the data is archived but incorrect.
This isn't an issue I think for tracking in the past so we could keep it in that case, because if we don't invalidate, it's not removed from the "remember to invalidate" option values. So we'll do it again. But other types of invalidation it probably shouldn't be done.
Though only issue i can think of w/ this change for tracking in the past is if visits keep coming in, we keep invalidating, except we only do this once per site before archiving, so I don't think that will happen...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user changes a funnel, last month is scheduled to rearchive
do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran? Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I get it right it wouldn't invalidate the day from a week ago because of a race condition if this happens shortly after the archiver ran?
In this specific example, yes. I suspect there are many strange edge cases like this that could happen.
Technically, when scheduling to rearchive last month then it would also invalidate each day and week so they would be archived again?
I guess it's better to use an example of last2 months, where there's no period that contains today, and the new visit is tracked in the month before last.
It would invalidate the adjacent days, and the week. When the week gets archived, it would see there is a usable day archive and just use it since value = DONE_OK and ts_archived is recent and date does not include today (if my reasoning is correct).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur took another look, and my reasoning for this was off, so removed the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could merge the remaining changes then once tests pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 will do, working on test fixes now
@diosmosis there are some tests failing. Is it maybe due to this PR? |
@tsteur fixed the test that was failing, the ArchiveInvalidationTest was kind of broken, fixed a bunch of things there and found an issue in Loader.php. W/ browser archiving, the invalidate doesn't happen early enough, so moved it. Will merge if tests pass on travis. |
@tsteur fixed the tests and modified invalidation in CronArchive.php again to do so. Here is how the invalidation uses the ttl:
Does this sound right to you? |
@diosmosis sounds right. I checked the documentation in https://github.com/matomo-org/matomo/blob/4.2.0/config/global.ini.php#L323-L340 again and it is indeed only applied to periods including today.
so maybe here we would actually indeed not check the TTL which I think you had in the code initially and I was asking for more information if I remember correctly. does that make sense? |
It makes sense. I left it, since in this case, it should be fine to have the TTL check (we will always try and invalidate again). It might also be useful if a user tracks lots of data in the past continuously? Like the cloud user that tracks data to yesterday instead of today. Then if the ttl is 6hrs, but we run core:archive every hour, maybe we'd still want to wait until the archives are out of date before invalidating? |
Yes that's good 👍 |
…child archive or when all child archives are usable while still respecting ttl for periods that include today
@tsteur last commit will make sure ranges are only force archived in the following situations:
If the range contains today, then we rely only on the archive becoming too old (so we respect the TTL). I also made sure the same logic worked when browser triggered archiving of segments was allowed. I tried to make the logic as clear as possible and the added query is only executed if all other checks fail and only for range archives. Added several new tests as well. Need to add a couple more to ArchiveSelectorTest if this all makes sense to you. When reviewing the pr I noticed the new test I wrote previously in CronArchiveTest was incorrect, so found these edge cases this way. |
…iod is used and archiving is enabled for the current request/period
// made invalid, we will correctly re-archive below. | ||
if ($this->invalidateBeforeArchiving) { | ||
$this->invalidatedReportsIfNeeded(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Invalidating before we look for an archive, otherwise, the invalidation won't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports
sets this to true
when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled
or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tsteur I think it's required for browser archiving to support the "tracking data in the past" feature. Otherwise, the invalidations that get queued in the option table never get applied. We could add isBrowserArchivingEnabled
as a safety. I don't think we can restrict to period because it's meant to execute on every query (basically it just does this: "are there queued invalidations? if so, let's invalidate before we launch archiving"). Since it's just for browser archiving I don't think we have to care too much about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis this could be still executed though when browser archiving is disabled right? And then because archive php is not triggered it might invalidate them? Might be good to test this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, I think in the old code we didn't invalidate in core:archive, so maybe it was all done here... I'll add the check for browser archiving being enabled.
if ($endDateTimestamp) { | ||
// past archive | ||
return $endDateTimestamp; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if we see a range period and we are allowed to archive and we're in a browser request, we check the ttl no matter what.
So if range archiving via browser is forced, we always archive in the browser every time the ttl is no longer valid.
If range archiving is not allowed via browser, we use the default logic (if the range does not include today, use latest archive that was archived after end date). If it includes today, we check the TTL using the existing code.
@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite) | |||
'date' => $date->getDatetime(), | |||
]); | |||
|
|||
$this->invalidateWithSegments([$idSite], $date->toString(), 'day'); | |||
$isYesterday = $dateStr == 'yesterday'; | |||
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this and below code skips a particular check if we are invalidating yesterday
(not a date including yesterday
, just yesterday
; we do this like we're invalidating today
in case the day has changed and we want to initiate archiving for yesterday).
Since we only care if the day changed since the archive was created, we want to check if there's any archive available that we can use, so we don't want to do a ttl check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we maybe also add inline a comment for this explaining this behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…rchiveIdAndVisits is used in multiple code paths now
if (!empty($idArchives) | ||
&& !$this->params->getArchiveOnlyReport() | ||
&& !Rules::isForceArchivingSinglePlugin() | ||
&& !$this->shouldForceInvalidatedArchive($value, $tsArchived) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: if an archive is found to be invalidated, we check if we should force archiving. This check was originally in ArchiveSelector, but I moved it since ArchiveSelector is used in multiple places now.
@tsteur made some changes ready for another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis looks good so far. Added a few comments that may need looking at.
Also I'm thinking it may be useful to add a new method isArchivingEnabledFor
like below since there are a lot of !
and it makes the code maybe more easily understandable. I'm sure we could use that method eventually in a lot of places and change a lot of isArchivingDisabledFor
calls
diff --git a/core/ArchiveProcessor/Loader.php b/core/ArchiveProcessor/Loader.php
index 4e792338cc..f79714e882 100644
--- a/core/ArchiveProcessor/Loader.php
+++ b/core/ArchiveProcessor/Loader.php
@@ -275,7 +275,7 @@ class Loader
{
// for range periods we can archive in a browser request request, make sure to check for the ttl no matter what
$isRangeArchiveAndArchivingEnabled = $this->params->getPeriod()->getLabel() == 'range'
- && !Rules::isArchivingDisabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
+ && Rules::isArchivingEnabledFor([$this->params->getSite()->getId()], $this->params->getSegment(), $this->params->getPeriod()->getLabel());
if (!$isRangeArchiveAndArchivingEnabled) {
$endDateTimestamp = self::determineIfArchivePermanent($this->params->getDateEnd());
@@ -444,7 +444,7 @@ class Loader
// the archive is invalidated and we are in a browser request that is allowed archive it
if ($value == ArchiveWriter::DONE_INVALIDATED
- && !Rules::isArchivingDisabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
+ && Rules::isArchivingEnabledFor([$params->getSite()->getId()], $params->getSegment(), $params->getPeriod()->getLabel())
) {
// if coming from core:archive, force rearchiving, since if we don't the entry will be removed from archive_invalidations
// w/o being rearchived
diff --git a/core/ArchiveProcessor/Rules.php b/core/ArchiveProcessor/Rules.php
index 947611e5c5..27d505a7ea 100644
--- a/core/ArchiveProcessor/Rules.php
+++ b/core/ArchiveProcessor/Rules.php
@@ -197,6 +197,11 @@ class Rules
return !$generalConfig['browser_archiving_disabled_enforce'];
}
+ public static function isArchivingEnabledFor(array $idSites, Segment $segment, $periodLabel)
+ {
+ return !self::isArchivingDisabledFor($idSites, $segment, $periodLabel);
+ }
+
public static function isArchivingDisabledFor(array $idSites, Segment $segment, $periodLabel)
{
$generalConfig = Config::getInstance()->General;
// made invalid, we will correctly re-archive below. | ||
if ($this->invalidateBeforeArchiving) { | ||
$this->invalidatedReportsIfNeeded(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diosmosis is this maybe only needed for certain periods or so? I see CoreAdminHome.archiveReports
sets this to true
when archive php is not triggered meaning this is the case for most requests even when browser archiving is disabled. It then shouldn't invalidate any reports. Maybe need to use something like $this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled
or maybe only add it for some periods like ($this->invalidateBeforeArchiving && Rules::isBrowserArchivingEnabled) || ($this->invalidateBeforeArchiving && period === range && Rules::isRangeArchivingEnabled)
.
core/ArchiveProcessor/Loader.php
Outdated
} | ||
|
||
// if coming from a browser request, and period does not contain today, force rearchiving | ||
if (!$params->getPeriod()->isDateInPeriod(Date::factory('today'))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add the timezone to today
depending on the idsite maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -857,10 +857,11 @@ public function invalidateRecentDate($dateStr, $idSite) | |||
'date' => $date->getDatetime(), | |||
]); | |||
|
|||
$this->invalidateWithSegments([$idSite], $date->toString(), 'day'); | |||
$isYesterday = $dateStr == 'yesterday'; | |||
$this->invalidateWithSegments([$idSite], $date->toString(), 'day', false, $doNotIncludeTtlInExistingArchiveCheck = $isYesterday); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we maybe also add inline a comment for this explaining this behaviour?
@tsteur applied review feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if tests pass 👍
…ile methods used in next test were loaded)
Description:
Currently, if a range archive is invalid, and the user views it, it will not re-archive. This is because ranges are generally archived well after the end date, so the ts_archived will always be later than the minimum processed time: https://github.com/matomo-org/matomo/blob/3.x-dev/core/ArchiveProcessor/Loader.php#L223-L236
I think, previously, if going through the "trigger archiving" code path (https://github.com/matomo-org/matomo/blob/4.x-dev/core/Archive.php#L571), we never looked for invalidated archives, so we'd always initiate archiving. This change happened in 3c2b5f7 to make sure we don't initiate archiving of invalidated archives under certain circumstances if we don't have to.
Also for some reason a test was removed in a commit (but the test data was still there)... I put it back.
FYI @tsteur
Review