@tsteur opened this Issue on March 11th 2021 Member

When configuring eg process_new_segments_from = "segment_creation_time" and you create a segment on December 12th 2021, then we're not supposed to archive any data before that.

The cron archive logic respects that. However, it seems that when it starts processing the week, month and year then it will archive for example all the days from Dec 1st - Dec 12th as part of the month archive and all the previous days from Jan 1st to end November as part of the year archive. This means there's more archiving than it should be happening and because it would be trying to archive hundreds of days as part of the year archive it would likely eventually also run into memory issues etc.

Goal would be to not launch the archiver for example here: https://github.com/matomo-org/matomo/blob/4.2.1/core/Archive.php#L560-L563 and/or here https://github.com/matomo-org/matomo/blob/4.2.1/plugins/CoreAdminHome/API.php#L257 or somewhere else when it is trying to archive a day that is not supposed to be archived based on above config setting.

@diosmosis commented on March 14th 2021 Member

@tsteur should this be done just for Cloud or for matomo core? (ie, the pluginOnly change was just done for cloud)

@tsteur commented on March 14th 2021 Member

@diosmosis should be done in both

@flamisz commented on March 15th 2021 Contributor

I spend a couple of hours on this issue already (mostly learned about the archiving process itself) but still not getting closer. I happy to spend more, but would be maybe more useful if I can do pair programming on this with someone, or if someone can look at it and give some explanation :).
I'm not even sure the logic respects that setting for the days itself. I created a brand new matomo, with empty db, and when I run the archive it found events in the past (2021-03-11 was before the date of the creation of the segment)

INFO [2021-03-15 05:14:33] 22140  Starting Matomo reports archiving...
INFO [2021-03-15 05:14:33] 22140  Start processing archives for site 1.
INFO [2021-03-15 05:14:33] 22140    Will invalidate archived reports for today in site ID = 1's timezone (2021-03-15 00:00:00).
INFO [2021-03-15 05:14:35] 22140  Archived website id 1, period = day, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 0.998s
INFO [2021-03-15 05:14:35] 22140  Archived website id 1, period = day, date = 2021-03-11, segment = 'actions>=2', 3 visits found. Time elapsed: 1.288s
INFO [2021-03-15 05:14:37] 22140  Archived website id 1, period = week, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 1.269s
INFO [2021-03-15 05:14:37] 22140  Archived website id 1, period = week, date = 2021-03-08, segment = 'actions>=2', 11 visits found. Time elapsed: 2.143s
INFO [2021-03-15 05:14:38] 22140  Archived website id 1, period = month, date = 2021-03-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.909s
INFO [2021-03-15 05:14:39] 22140  Archived website id 1, period = year, date = 2021-01-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.903s
INFO [2021-03-15 05:14:39] 22140  Finished archiving for site 1, 6 API requests, Time elapsed: 6.263s [1 / 1 done]
INFO [2021-03-15 05:14:40] 22140  Done archiving!

Any help appreciated (I just don't want to throw away extra hours, if someone can solve/explain it in less than 1 hour).
ping @tsteur @diosmosis @sgiehl


@sgiehl commented on March 15th 2021 Member

@flamisz If I understood that correctly the config for process_new_segments_from is actually only used when a segment is created/updated. In this case the segment archives are invalidated for the past (depending on the setting).
The normal archiving process doesn't use the setting. So if you hadn't run the archiver before, the archiver will also archive newly created segments for periods that hadn't been processed at all before.

@tsteur commented on March 15th 2021 Member

@flamisz not sure which config setting you used for the segments?

We can have a quick chat about it tmrw. Generally, when using eg process_new_segments_from = "segment_last_edit_time" (or any other setting that is not the default value beginning of time) then the goal would be to not archive any day before that day. If you had segment_creation_time then the archiver should not have launched that archive for the day (this be one bug), and another bug be that the day may be archived as part of a month or yearly archive.

@diosmosis commented on March 15th 2021 Member

For the solution it may help to look at Cloud#250 as the solution here should be fairly similar.

@diosmosis commented on March 16th 2021 Member

@flamisz if you need more context I can explain the entire problem, too. It's a bug that involves four separate systems (core archiving/aggregation logic, archive invalidation, archive processing and segment creation/updating), so it makes sense it would be hard to understand w/o context.

@flamisz commented on March 16th 2021 Contributor

@diosmosis that would be great thanks. I see it's more complex than I expected and most of these are new for me.

@diosmosis commented on March 16th 2021 Member

@flamisz here is my best attempt, I'll explain how it emerges by explaining each system. The explanations are not completely in depth because there's still a lot of difference that isn't super important to know. Also I don't know how much you already know so feel free to ignore anything I'm explaining to you twice.

core archiving/aggregation

core archiving is the logic that starts in Archive.php. The Archive class let's you query archive data. If the data is not found, and the current request is allowed to initiate aggregation (eg, $_GET['trigger'] = 'archivephp' or browser initiated archiving is enabled in the UI), the process is started and individual plugin Archivers get invoked. This is handled by the Loader.php/PluginsArchiver.php classes.

The way aggregation is done is different based on period. Day periods result in LogAggregator being used to query data over log tables. This is too much data to do over weeks and months and years, so for periods that are greater than a day, we sum up the reports within the period. For example, for week periods, we sum up the days within the week. For month periods we sum up the weeks inside the month + the days at the edges. For years we use the months in the year. This is handled by the ArchiveProcessor class. Specifically, the class creates it's own Archive instance to query each subperiod: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L116-L118 . This instance is used to query data to aggregate, for example: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L346 . The recursive use of Archive means a year archive can eventually initiate archiving for a day period, if the archive doesn't already exist.

archive invalidation/archive processing

When an archive needs to be reprocessed for some reason, it is marked as invalidated. This happens differently for browser triggered archiving and for core:archive. For browser triggered archiving, we mark existing archives in the table as DONE_INVALIDATED. When a user requests a report from one of those archives, we notice it is DONE_INVALIDATED in Loader.php and make sure not to use it, and instead initiate archiving as if there was no data. This works for browser triggered archiving because it will happen for every user request, we only need to process an archive if a user requests it.

For core:archive, however, we can't do this, we want to process all invalidated archives. And it is too expensive to query every archive table for invalidated archives each time, so we use the archive_invalidations table. We insert invalidations here and pull in the proper order to process them (the proper order being lower periods first so days, weeks, months, years & segments after the "all visits" archive). We do lower periods first so higher periods will not initiate archiving for lower periods in a single request.

When processing an invalidation, we basically create a new process (either an HTTP request or an instance of the climulti:request command) that initiates the same core archiving process via Loader.php (starting in an API method in CoreAdminHome).

Invalidating a lower period necessitates invalidating higher periods. For example, invalidating a day because it has more visits, means the week, month and year it is in, is no longer accurate.

segment creation/updating

When a segment is created/updated, we want to reprocess data for past dates. The amount of data to process is controlled by the INI config mentioned above. We do this by invalidating past data. This is only really an issue for core:archive. For browser triggered archiving, users will request the new/edited segment, and since the hash will be new/different there will be no archive data to serve. So we initiate archiving.

For core:archive, however, we invalidate dates in the past so the command will see them in the archive_invalidations table and process them. If process_new_segments_from is set to, eg, last6, we would archive the last6 months (there are other values we can use, but I'll just focus on this one).

So let's say today, on March 15th, we modify a segment. Now we want to rearchive the last 6 months. So we invalidate the days from Oct 15th => Mar 15th. This means the weeks and months also get invalidated. And also, crucially, the years, 2020 & 2021.
Note: we don't end up invalidating Jan-Oct 15th, just October 15th onwards. So the year 2020 is invalidated, but many of the months inside are not invalidated, and also archive data for it does not exist (because the segment hash changed when the segment was edited).

the bug

So the bug is this:

  • a user creates/edits a segment on today, March 15th
  • we invalidate in the past from October 15th => March 15th, the year 2020 and 2021 gets invalidated
  • core:archive archives the months and weeks and days between October 15th => March 15th
  • then core:archive archives the years 2020/2021
  • since there is no data for January for this segment, the month archive is initiated. and so are the weeks, and days in side the month. (and the same for every month until october)

This is all done in a single request so not only does it create a lot of load for a lot of unnecessary work, memory leaks can add up and cause process crashes.

how it was fixed for plugin archives

This problem also occurs for plugin archives. When certain plugins are activated/deactivated or certain entities are updated/created (eg, a funnel or custom report), we want historical data to be processed for them, so we invalidate in the past just as we do for segments. And since no data will exist in the same way, we end up doing way too much archiving.

This was fixed by introducing Archive::forceFetchingWithoutLaunchingArchiving(). When we detect one of these kind of requests (because pluginOnly=1 was set in the request), we set it, which makes sure child archives aren't re-processed if there's no data.

A similar fix would work here, however it's a bit more complicated, because there isn't a way to differentiate between whether an archiving request was made because a segment was created/updated or because the user actually invalidated in the past. For plugin archives, pluginOnly=1 is only set in this case, so we know we can do this.

I hope that helped, feel free to ask questions if something's not clear.

@flamisz commented on March 16th 2021 Contributor

Thanks @diosmosis for the detailed explanation. I'll go through it and let you know if I still have question.

This Issue was closed on May 7th 2021
Powered by GitHub Issue Mirror