New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting process_new_segments_from not respected when archiving bigger periods #17336
Comments
@tsteur should this be done just for Cloud or for matomo core? (ie, the pluginOnly change was just done for cloud) |
@diosmosis should be done in both |
I spend a couple of hours on this issue already (mostly learned about the archiving process itself) but still not getting closer. I happy to spend more, but would be maybe more useful if I can do pair programming on this with someone, or if someone can look at it and give some explanation :).
Any help appreciated (I just don't want to throw away extra hours, if someone can solve/explain it in less than 1 hour). Thanks |
@flamisz If I understood that correctly the config for |
@flamisz not sure which config setting you used for the segments? We can have a quick chat about it tmrw. Generally, when using eg |
For the solution it may help to look at Cloud#250 as the solution here should be fairly similar. |
@flamisz if you need more context I can explain the entire problem, too. It's a bug that involves four separate systems (core archiving/aggregation logic, archive invalidation, archive processing and segment creation/updating), so it makes sense it would be hard to understand w/o context. |
@diosmosis that would be great thanks. I see it's more complex than I expected and most of these are new for me. |
@flamisz here is my best attempt, I'll explain how it emerges by explaining each system. The explanations are not completely in depth because there's still a lot of difference that isn't super important to know. Also I don't know how much you already know so feel free to ignore anything I'm explaining to you twice. core archiving/aggregation core archiving is the logic that starts in Archive.php. The Archive class let's you query archive data. If the data is not found, and the current request is allowed to initiate aggregation (eg, The way aggregation is done is different based on period. Day periods result in LogAggregator being used to query data over log tables. This is too much data to do over weeks and months and years, so for periods that are greater than a day, we sum up the reports within the period. For example, for archive invalidation/archive processing When an archive needs to be reprocessed for some reason, it is marked as invalidated. This happens differently for browser triggered archiving and for core:archive. For browser triggered archiving, we mark existing archives in the table as DONE_INVALIDATED. When a user requests a report from one of those archives, we notice it is DONE_INVALIDATED in Loader.php and make sure not to use it, and instead initiate archiving as if there was no data. This works for browser triggered archiving because it will happen for every user request, we only need to process an archive if a user requests it. For core:archive, however, we can't do this, we want to process all invalidated archives. And it is too expensive to query every archive table for invalidated archives each time, so we use the When processing an invalidation, we basically create a new process (either an HTTP request or an instance of the climulti:request command) that initiates the same core archiving process via Loader.php (starting in an API method in CoreAdminHome). Invalidating a lower period necessitates invalidating higher periods. For example, invalidating a day because it has more visits, means the week, month and year it is in, is no longer accurate. segment creation/updating When a segment is created/updated, we want to reprocess data for past dates. The amount of data to process is controlled by the INI config mentioned above. We do this by invalidating past data. This is only really an issue for core:archive. For browser triggered archiving, users will request the new/edited segment, and since the hash will be new/different there will be no archive data to serve. So we initiate archiving. For core:archive, however, we invalidate dates in the past so the command will see them in the archive_invalidations table and process them. If process_new_segments_from is set to, eg, last6, we would archive the last6 months (there are other values we can use, but I'll just focus on this one). So let's say today, on March 15th, we modify a segment. Now we want to rearchive the last 6 months. So we invalidate the days from Oct 15th => Mar 15th. This means the weeks and months also get invalidated. And also, crucially, the years, 2020 & 2021. the bug So the bug is this:
This is all done in a single request so not only does it create a lot of load for a lot of unnecessary work, memory leaks can add up and cause process crashes. how it was fixed for plugin archives This problem also occurs for plugin archives. When certain plugins are activated/deactivated or certain entities are updated/created (eg, a funnel or custom report), we want historical data to be processed for them, so we invalidate in the past just as we do for segments. And since no data will exist in the same way, we end up doing way too much archiving. This was fixed by introducing A similar fix would work here, however it's a bit more complicated, because there isn't a way to differentiate between whether an archiving request was made because a segment was created/updated or because the user actually invalidated in the past. For plugin archives, pluginOnly=1 is only set in this case, so we know we can do this. I hope that helped, feel free to ask questions if something's not clear. |
Thanks @diosmosis for the detailed explanation. I'll go through it and let you know if I still have question. |
When configuring eg
process_new_segments_from = "segment_creation_time"
and you create a segment on December 12th 2021, then we're not supposed to archive any data before that.The cron archive logic respects that. However, it seems that when it starts processing the week, month and year then it will archive for example all the days from Dec 1st - Dec 12th as part of the month archive and all the previous days from Jan 1st to end November as part of the year archive. This means there's more archiving than it should be happening and because it would be trying to archive hundreds of days as part of the year archive it would likely eventually also run into memory issues etc.
Goal would be to not launch the archiver for example here: https://github.com/matomo-org/matomo/blob/4.2.1/core/Archive.php#L560-L563 and/or here https://github.com/matomo-org/matomo/blob/4.2.1/plugins/CoreAdminHome/API.php#L257 or somewhere else when it is trying to archive a day that is not supposed to be archived based on above config setting.
The text was updated successfully, but these errors were encountered: