When configuring eg
process_new_segments_from = "segment_creation_time" and you create a segment on December 12th 2021, then we're not supposed to archive any data before that.
The cron archive logic respects that. However, it seems that when it starts processing the week, month and year then it will archive for example all the days from Dec 1st - Dec 12th as part of the month archive and all the previous days from Jan 1st to end November as part of the year archive. This means there's more archiving than it should be happening and because it would be trying to archive hundreds of days as part of the year archive it would likely eventually also run into memory issues etc.
Goal would be to not launch the archiver for example here: https://github.com/matomo-org/matomo/blob/4.2.1/core/Archive.php#L560-L563 and/or here https://github.com/matomo-org/matomo/blob/4.2.1/plugins/CoreAdminHome/API.php#L257 or somewhere else when it is trying to archive a day that is not supposed to be archived based on above config setting.
@tsteur should this be done just for Cloud or for matomo core? (ie, the pluginOnly change was just done for cloud)
I spend a couple of hours on this issue already (mostly learned about the archiving process itself) but still not getting closer. I happy to spend more, but would be maybe more useful if I can do pair programming on this with someone, or if someone can look at it and give some explanation :).
I'm not even sure the logic respects that setting for the days itself. I created a brand new matomo, with empty db, and when I run the archive it found events in the past (2021-03-11 was before the date of the creation of the segment)
START INFO [2021-03-15 05:14:33] 22140 Starting Matomo reports archiving... INFO [2021-03-15 05:14:33] 22140 Start processing archives for site 1. INFO [2021-03-15 05:14:33] 22140 Will invalidate archived reports for today in site ID = 1's timezone (2021-03-15 00:00:00). INFO [2021-03-15 05:14:35] 22140 Archived website id 1, period = day, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 0.998s INFO [2021-03-15 05:14:35] 22140 Archived website id 1, period = day, date = 2021-03-11, segment = 'actions>=2', 3 visits found. Time elapsed: 1.288s INFO [2021-03-15 05:14:37] 22140 Archived website id 1, period = week, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 1.269s INFO [2021-03-15 05:14:37] 22140 Archived website id 1, period = week, date = 2021-03-08, segment = 'actions>=2', 11 visits found. Time elapsed: 2.143s INFO [2021-03-15 05:14:38] 22140 Archived website id 1, period = month, date = 2021-03-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.909s INFO [2021-03-15 05:14:39] 22140 Archived website id 1, period = year, date = 2021-01-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.903s INFO [2021-03-15 05:14:39] 22140 Finished archiving for site 1, 6 API requests, Time elapsed: 6.263s [1 / 1 done] INFO [2021-03-15 05:14:40] 22140 Done archiving!
@flamisz If I understood that correctly the config for
process_new_segments_from is actually only used when a segment is created/updated. In this case the segment archives are invalidated for the past (depending on the setting).
The normal archiving process doesn't use the setting. So if you hadn't run the archiver before, the archiver will also archive newly created segments for periods that hadn't been processed at all before.
@flamisz not sure which config setting you used for the segments?
We can have a quick chat about it tmrw. Generally, when using eg
process_new_segments_from = "segment_last_edit_time" (or any other setting that is not the default value
beginning of time) then the goal would be to not archive any day before that day. If you had
segment_creation_time then the archiver should not have launched that archive for the day (this be one bug), and another bug be that the day may be archived as part of a month or yearly archive.
For the solution it may help to look at Cloud#250 as the solution here should be fairly similar.
@flamisz if you need more context I can explain the entire problem, too. It's a bug that involves four separate systems (core archiving/aggregation logic, archive invalidation, archive processing and segment creation/updating), so it makes sense it would be hard to understand w/o context.
@diosmosis that would be great thanks. I see it's more complex than I expected and most of these are new for me.
@flamisz here is my best attempt, I'll explain how it emerges by explaining each system. The explanations are not completely in depth because there's still a lot of difference that isn't super important to know. Also I don't know how much you already know so feel free to ignore anything I'm explaining to you twice.
core archiving is the logic that starts in Archive.php. The Archive class let's you query archive data. If the data is not found, and the current request is allowed to initiate aggregation (eg,
$_GET['trigger'] = 'archivephp' or browser initiated archiving is enabled in the UI), the process is started and individual plugin Archivers get invoked. This is handled by the Loader.php/PluginsArchiver.php classes.
The way aggregation is done is different based on period. Day periods result in LogAggregator being used to query data over log tables. This is too much data to do over weeks and months and years, so for periods that are greater than a day, we sum up the reports within the period. For example, for
week periods, we sum up the days within the week. For
month periods we sum up the weeks inside the month + the days at the edges. For
years we use the months in the year. This is handled by the
ArchiveProcessor class. Specifically, the class creates it's own
Archive instance to query each subperiod: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L116-L118 . This instance is used to query data to aggregate, for example: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L346 . The recursive use of
Archive means a year archive can eventually initiate archiving for a day period, if the archive doesn't already exist.
archive invalidation/archive processing
When an archive needs to be reprocessed for some reason, it is marked as invalidated. This happens differently for browser triggered archiving and for core:archive. For browser triggered archiving, we mark existing archives in the table as DONE_INVALIDATED. When a user requests a report from one of those archives, we notice it is DONE_INVALIDATED in Loader.php and make sure not to use it, and instead initiate archiving as if there was no data. This works for browser triggered archiving because it will happen for every user request, we only need to process an archive if a user requests it.
For core:archive, however, we can't do this, we want to process all invalidated archives. And it is too expensive to query every archive table for invalidated archives each time, so we use the
archive_invalidations table. We insert invalidations here and pull in the proper order to process them (the proper order being lower periods first so days, weeks, months, years & segments after the "all visits" archive). We do lower periods first so higher periods will not initiate archiving for lower periods in a single request.
When processing an invalidation, we basically create a new process (either an HTTP request or an instance of the climulti:request command) that initiates the same core archiving process via Loader.php (starting in an API method in CoreAdminHome).
Invalidating a lower period necessitates invalidating higher periods. For example, invalidating a day because it has more visits, means the week, month and year it is in, is no longer accurate.
When a segment is created/updated, we want to reprocess data for past dates. The amount of data to process is controlled by the INI config mentioned above. We do this by invalidating past data. This is only really an issue for core:archive. For browser triggered archiving, users will request the new/edited segment, and since the hash will be new/different there will be no archive data to serve. So we initiate archiving.
For core:archive, however, we invalidate dates in the past so the command will see them in the archive_invalidations table and process them. If process_new_segments_from is set to, eg, last6, we would archive the last6 months (there are other values we can use, but I'll just focus on this one).
So let's say today, on March 15th, we modify a segment. Now we want to rearchive the last 6 months. So we invalidate the days from Oct 15th => Mar 15th. This means the weeks and months also get invalidated. And also, crucially, the years, 2020 & 2021.
Note: we don't end up invalidating Jan-Oct 15th, just October 15th onwards. So the year 2020 is invalidated, but many of the months inside are not invalidated, and also archive data for it does not exist (because the segment hash changed when the segment was edited).
So the bug is this:
This is all done in a single request so not only does it create a lot of load for a lot of unnecessary work, memory leaks can add up and cause process crashes.
how it was fixed for plugin archives
This problem also occurs for plugin archives. When certain plugins are activated/deactivated or certain entities are updated/created (eg, a funnel or custom report), we want historical data to be processed for them, so we invalidate in the past just as we do for segments. And since no data will exist in the same way, we end up doing way too much archiving.
This was fixed by introducing
Archive::forceFetchingWithoutLaunchingArchiving(). When we detect one of these kind of requests (because
pluginOnly=1 was set in the request), we set it, which makes sure child archives aren't re-processed if there's no data.
A similar fix would work here, however it's a bit more complicated, because there isn't a way to differentiate between whether an archiving request was made because a segment was created/updated or because the user actually invalidated in the past. For plugin archives, pluginOnly=1 is only set in this case, so we know we can do this.
I hope that helped, feel free to ask questions if something's not clear.