New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make archiving process to respect process_new_segments_from settings #17351
Conversation
Hi @diosmosis, @tsteur and I had a quick chat this morning, and this is what figured in the end. I tested locally and I see it does return Can you please check it? Thanks |
@flamisz I see two possible issues w/ this solution (though it's on the right track):
@tsteur there might need to be some discussion on a good solution as it seems like it may be complicated to solve this. |
It looks like we need more discussions on this issue. I pause working on it until you'll have more time to dig deeper into it. |
@diosmosis great point. This makes me think it can't have worked in Matomo 3 either and we might need to lower the priority for this. Out of the box do you otherwise have any quick idea of how to fix this without the need of a big refactoring? Like ideally we'd know whether it was maybe forced to archive this segment for a range before segment creation but we can't really know (although maybe we could check the |
The only way I can think of is to have a new column on archive_invalidations like I can't think of a way to do it w/o a new column. I guess we could also just stick it on the name column value, but that would make the code a bit weirder imo. |
thanks @diosmosis just double checking so something like this wouldn't work?
Assuming that when someone invalidates eg a week, we also add the entries for each day to |
@tsteur I'm not sure that approach will work (if I understand it correctly). When checking a year, for example, the archiver should have already handled months, weeks days in the year, so they would have been removed from the invalidations table. |
Which be good that they are removed meaning it wouldn't launch archiving anyway in that case? as the days etc are already archived and it would only aggregate the existing data? |
@tsteur thinking out loud: so there are three cases, one where we archive correctly where date range > than process_new_segments_from's date:
this case works. next case, we archive correctly where some of date range < process_new_segments_from's date:
This is what we want. third case, user invalidates date range before process_new_segments_from's date, because they want that data rearchived (not automated use case):
This is not what we want, because the user specifically invalidated this date range, it wasn't automated. Checking the invalidation table isn't a way to differentiate between user initiated invalidations and scheduled rearchiving. there are probably some edge cases when archives fail for some reason, that might be an entirely different issue. |
I suppose you mean an actual custom date range like
Because of the setting there should be no invalidations from scheduled rearchiving so I thought it is maybe not needed to differentiate between user initiated? Basically, my thoughts (and hope) was that we would have entries in invalidations table before the segment creation time only when it was triggered by a user and then we could force the archiving maybe? |
No, sorry I didn't mean that. I just meant something invalidated by the user and not rearchived.
These invalidations are added right before archiving for a site, so they will always be there before archiving (and are removed after archiving). Sorry, I'm not entirely sure what you're getting at. The day archives will get removed after the day is archived, then the week will start archiving, but this is true of rearchiving AND of user initiated invalidation. There's no difference at the moment, which sort of the problem. It would be fairly trivial to signify it though w/ an extra db column on the invalidation table. Or we could hack something into the name column. |
Quickly wanted to check in on how this PR is progressing and what would be the next step(s)? (sorry haven't read the whole discussion 🙈 ) (asking because this PR is the first "stage" of hoping to fix https://innocraft.atlassian.net/jira/software/projects/L3/boards/20?selectedIssue=L3-46 which is not currently a major issue it seems (as customer hasn't complained) but is nevertheless quite important issue (fyi they are an enterprise customer). also it generates many emails). |
@diosmosis can you maybe explain what you mean by @mattab this one is quite complex and it's not a regression AFAIK and it'll take a while longer. Maybe there's another workaround meanwhile |
@tsteur can you maybe explain what you mean by this is true of rearchiving? I'm not quite sure what you mean with rearchiving here? I mean there's no difference between "user invalidates archive, it gets processed by core:archive" and "system schedules rearchiving, system invalidates archives, they get processed by core:archive". There's no way to tell those invalidations apart. BUT in this issue, we want to apply the archive limiting ONLY to rearchiving segments. If a user invalidates in the past before So we need some way to tell the invalidations apart, whether they were created via user invalidation or via system invalidation. If a user tracks data in the past before |
@tsteur afaik the other workaround might be to implement rankingQuery in custom dimensions archiving (what @diosmosis had suggested in the L3) |
@mattab I can work on that as part of the L3 |
FYI for segments we actually wouldn't want to archive it if a user tracks data for a period that is eg before segment creation date. We'd only want to archive it if user specifically invalidates it. If we could have a check to not create the invalidations for such a segment when data is tracked in the past, then it would probably work maybe? |
@tsteur does that apply to all possible values of process_new_segments_from or just creation date? ie, if it is set to last6, then two months pass, and the user tracks data from 7 months ago (which was after when the segment was created), we archive or not? And since we can sometimes archive in the past when creating a segment (ie, if set to last6, and we create it in Nov, then we rearchive from May - Nov), should we still allow tracking data in May and rearchiving? And would it not apply if process_new_segments_from is set to beginning_of_time?\
Even with this, we'd still need to differentiate between user created invalidation and system created invalidation. |
We would archive the last 6 months before creation date (according to the documentation in the global ini config). But we wouldn't archive the reports 7 months before creation date if someone was to track data into this.
We would still allow tracking because it's required for All Visits but we wouldn't rearchive the data for this segment if it is before the 6 months. We would only do it if someone was to manually invalidate the data maybe to trigger the archiving this way. |
This is the problem I'm trying to describe: there is no way currently to tell between an invalidation that was manually invalidated and an invalidation that was created through |
I guess my thinking is that |
Well the issue isn't that it will invalidate more than is necessary, it already only invalidates what it needs to (eg, last6, invaldates the last 6 months, but ALSO the year 2021 and 2020). This ends up triggering archiving of missing child archives, so reArchiveReport can't really stop child archives for those invalidations from being initiated, unless we add an extra column to the table and mark the invalidations in some way. |
@tsteur another option if not adding a column is having new statuses, instead of just |
Technically we could add a new column but trying to avoid making anything possibly more complicated etc then it needs to be.
I'm really thinking that be generally find if it triggers year 2021 and 2020 as by then it would have already triggered the days, weeks, and months that are allowed to be archived. I'll try to show what I mean below otherwise might need to just try and see if it works or go on a call to chat so we're sure we're meaning the same thing. I'm thinking eg of these flows:
When it then tries to archive any day before 29th Jan 2021 we wouldn't ever archive that period since no entry in archive_invalidations exist and it is before When we then trigger the archive of day 29th Jan 2021, it would allow this since that entry exist. If it was to trigger the archive of year 2021 before any of the other months or weeks or days were processed for some reasons, then it would still allow the archiving of these subperiods because they exist in the archive_invalidations table. If it was to trigger the archive of year 2021 and all the other archive_invalidations were already processed and all the entries don't exist in the table anymore then it would correctly not launch the archiving of these subperiods (since it can't find a matching entry in the table) which be fine because would have been already archived anyway and there is no need to archive them again. If then later a tracking request comes for 22nd Jan 2021, then we wouldn't invalidate anything for this segment because it is before If then later a tracking request comes for 31st Jan 2021, we would invalidate the following entries:
and the same flow would work again because it would only process these days. If a user was to invalidate manually any date, that be fine too and we'd allow to invalidate any date (even before Jan 29th) and the system would correctly archive only the ones specifically that were invalidated as the I might be missing some things but not sure. Hoping my thinking might be bit more clear maybe? Sorry for the late response. |
@tsteur haven't read the comment, but just fyi, |
This is where I think it doesn't work. If the user invalidates before Jan 29th, say they invalidate the day Dec 1, 2020. The day period is added to archive_invalidations, as well as the week, month and year. When we archive them we would archive the day first. I think in your solution we check for whether there are any child period invalidations in the table before last91? There aren't for the day, so we archive it. Then we remove the entry from archive_invalidations. Then we try to archive the week. But there are no day periods before last91 for the segment, so it never archives. Instead we assume we don't have to rearchive, and remove the archive_invalidation row. Same thing happens for month and year. Does this make sense or would this happen a different way? |
@diosmosis I added a commit that roughly shows what I was thinking. I didn't actually test it and think it would maybe need to be adjusted slightly to get maybe correct date in timezone and the right done flags (might have to test multiple done flags not sure).
We would correctly archive the day first. We wouldn't check if there are any child period invalidations in the table. Only if the same period/date exists in the invalidations table then we'd allow the archiving. We would only archive the days and weeks and months that were requested and wouldn't archive any other ones. Maybe check the PR to see what I'm thinking. |
You're right, I was thinking of the rearchive lastN INI config.
Ok, I see, so if year is invalidated, but not a month in the year, we can tell the month shouldn't be invalidated, at the time of invalidating the month (rather than checking whether we should launch archiving for child periods when archiving the year). It makes sense to me now. |
77d2e53
to
3b83de0
Compare
Hi @diosmosis, can you please have the last look at this PR? Tests are looking good, I check those failing UI tests. EDIT: those UI tests are solved with the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
core/ArchiveProcessor/Loader.php
Outdated
); | ||
|
||
$table = Common::prefixTable('archive_invalidations'); | ||
$sql = "SELECT idinvalidation FROM `$table` WHERE idsite = ? AND date1 = ? AND date2 = ? AND `period` = ? AND `name` = ?"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query could have a LIMIT 1 in case there are invalidations w/ the same params, but different $report
s.
EDIT: actually, if we get a request to do a whole archive, but there is only an invalidation w/ a report
value set, we should probably still skip. It might be needed to take into account any requested report (cc @tsteur).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we turn off the browser triggered archiving, when we load a report it will archive for that report, only for that one, am I right? Could it be happen somehow this gets into the invalidation table? Because if it can, we don't want to skip in this case, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we turn off the browser triggered archiving, when we load a report it will archive for that report, only for that one, am I right?
I'm not sure what you mean. The 'archive only specific report' functionality is only triggered through core:archive and if there is a 'report' value set in archive_invalidations. This only happens if we reArchiveReport for a specific report. It never gets triggered via browser triggered archiving.
The issue I'm thinking of is if there is a request the archive, eg, period = day, date = 2020-01-02, idSite = 2, segment = blahblahblah, plugin = Plugin
and the only item in the invalidation table is:
idsite = 2, period = day, date1 = 2020-01-02, date2 = 2020-01-02, name = done . md5(blahblahblah).Plugin, report = specificreport
In this case if everything else about the period/segment is true (it's outside of the process_new_segment_from range), we don't want to archive, because it would archive more then the specific report (it would do the whole plugin).
If we receive another request where report = specificreport as well, the we do want to archive because it's in the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I was confused. I thought about segment archiving in browser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't modified the reports
part but added the limit 1
.
I'm going to add the specific report case and see how it goes with tests.
// automatically, only a user can specifically invalidate | ||
continue; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code (the function calls and ifs) looks very similar to the above code. Can it be moved to a method to reduce duplication? Maybe a method in the SegmentArchiving
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find any useful way to introduce a new function, but refactored both files, mostly get rid off temporary variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flamisz I was thinking something like:
SegmentArchiving.php
public function getSegmentReArchiveDateIfPastEndOfArchivingPeriod($params)
{
$segmentInfo = $this->findSegmentForHash($params->getSegment()->getHash(), $params->getSite()->getId());
if (!$segmentInfo) {
return false;
}
$segmentArchiveStartDate = $this->getReArchiveSegmentStartDate($segmentInfo);
if ($segmentArchiveStartDate !==null && $segmentArchiveStartDate->isLater($params->getPeriod()->getDateEnd()->getEndOfDay())) {
return $segmentArchiveDate;
}
return false;
}
then in Loader.php:
public function canSkipArchiveForSegment()
{
/** @var SegmentArchiving */
$segmentArchiving = StaticContainer::get(SegmentArchiving::class);
$segmentReArchiveDate = $segmentArchiving->getSegmentReArchiveDateIfPastEndOfArchivingPeriod($this->params);
if (!empty($segmentReArchiveDate)) {
$doneFlag = Rules::getDoneStringFlagFor(
[$params->getSite()->getId()],
$params->getSegment(),
$params->getPeriod()->getLabel(),
$params->getRequestedPlugin()
);
// if there is no invalidation where the report is null, we can skip
// if we have invalidations for the period and name, but only for a specific reports, we can skip
// if the report is not null we only want to rearchive if we have invalidation for that report
// if we don't find invalidation for that report, we can skip
return !$this->dataAccessModel->hasInvalidationForPeriodAndName($idSite, $params->getPeriod(), $doneFlag, $params->getArchiveOnlyReport());
}
return false;
}
and above in CronArchive.php:
$segmentReArchiveDate = $segmentArchiving->getSegmentReArchiveDateIfPastEndOfArchivingPeriod($this->params);
if (!empty($segmentReArchiveDate)) {
// the system is not allowed to invalidate reports for this period
// automatically, only a user can specifically invalidate
continue;
}
Maybe this doesn't work or it's not a good idea for another reason, I'll just take your word for it since it's not a huge deal.
My latest commit contains the code is bringing into account requested report. |
@diosmosis this is the last PR we need to get merged for 4.3 release. When you have time be great to have another look |
core/ArchiveProcessor/Loader.php
Outdated
// if we don't find invalidation for that report, we can skip | ||
if ($params->getArchiveOnlyReport() && !$this->dataAccessModel->hasInvalidationForPeriodAndName($idSite, $params->getPeriod(), $doneFlag, $params->getArchiveOnlyReport())) { | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the two if
s above be simplified to just:
return !$this->dataAccessModel->hasInvalidationForPeriodAndName($idSite, $params->getPeriod(), $doneFlag, $params->getArchiveOnlyReport());
?
Since archive only report will be empty when it's not set, we can just pass it as is, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. To be honest, I'm not sure, can't remember exactly, but probably you are right. Needs to be tested to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought and walked through it, it should work the way you recommended, I updated the PR.
Hi @diosmosis, I can see only one, I updated the pr based on that recommendation |
@flamisz this is the other one: #17351 (comment) |
thanks @diosmosis I see. Yeah I didn't address that, totally forget. I look into it today. |
@flamisz some tests are failing now, looks like a quick fix, can you make an update? I'll merge afterwards. |
@diosmosis it's been fixed now |
Description:
fixes #17336
Review