Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting process_new_segments_from not respected when archiving bigger periods #17336

Closed
tsteur opened this issue Mar 11, 2021 · 10 comments · Fixed by #17351
Closed

Setting process_new_segments_from not respected when archiving bigger periods #17336

tsteur opened this issue Mar 11, 2021 · 10 comments · Fixed by #17351
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Performance For when we could improve the performance / speed of Matomo. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented Mar 11, 2021

When configuring eg process_new_segments_from = "segment_creation_time" and you create a segment on December 12th 2021, then we're not supposed to archive any data before that.

The cron archive logic respects that. However, it seems that when it starts processing the week, month and year then it will archive for example all the days from Dec 1st - Dec 12th as part of the month archive and all the previous days from Jan 1st to end November as part of the year archive. This means there's more archiving than it should be happening and because it would be trying to archive hundreds of days as part of the year archive it would likely eventually also run into memory issues etc.

Goal would be to not launch the archiver for example here: https://github.com/matomo-org/matomo/blob/4.2.1/core/Archive.php#L560-L563 and/or here https://github.com/matomo-org/matomo/blob/4.2.1/plugins/CoreAdminHome/API.php#L257 or somewhere else when it is trying to archive a day that is not supposed to be archived based on above config setting.

@tsteur tsteur added Bug For errors / faults / flaws / inconsistencies etc. c: Performance For when we could improve the performance / speed of Matomo. labels Mar 11, 2021
@tsteur tsteur added this to the 4.4.0 milestone Mar 11, 2021
@tsteur tsteur added the Regression Indicates a feature used to work in a certain way but it no longer does even though it should. label Mar 11, 2021
@tsteur tsteur modified the milestones: 4.4.0, 4.3.0 Mar 11, 2021
@flamisz flamisz self-assigned this Mar 12, 2021
@diosmosis
Copy link
Member

@tsteur should this be done just for Cloud or for matomo core? (ie, the pluginOnly change was just done for cloud)

@tsteur
Copy link
Member Author

tsteur commented Mar 14, 2021

@diosmosis should be done in both

@flamisz
Copy link
Contributor

flamisz commented Mar 15, 2021

I spend a couple of hours on this issue already (mostly learned about the archiving process itself) but still not getting closer. I happy to spend more, but would be maybe more useful if I can do pair programming on this with someone, or if someone can look at it and give some explanation :).
I'm not even sure the logic respects that setting for the days itself. I created a brand new matomo, with empty db, and when I run the archive it found events in the past (2021-03-11 was before the date of the creation of the segment)

START
INFO [2021-03-15 05:14:33] 22140  Starting Matomo reports archiving...
INFO [2021-03-15 05:14:33] 22140  Start processing archives for site 1.
INFO [2021-03-15 05:14:33] 22140    Will invalidate archived reports for today in site ID = 1's timezone (2021-03-15 00:00:00).
INFO [2021-03-15 05:14:35] 22140  Archived website id 1, period = day, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 0.998s
INFO [2021-03-15 05:14:35] 22140  Archived website id 1, period = day, date = 2021-03-11, segment = 'actions>=2', 3 visits found. Time elapsed: 1.288s
INFO [2021-03-15 05:14:37] 22140  Archived website id 1, period = week, date = 2021-03-15, segment = 'actions>=2', 1 visits found. Time elapsed: 1.269s
INFO [2021-03-15 05:14:37] 22140  Archived website id 1, period = week, date = 2021-03-08, segment = 'actions>=2', 11 visits found. Time elapsed: 2.143s
INFO [2021-03-15 05:14:38] 22140  Archived website id 1, period = month, date = 2021-03-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.909s
INFO [2021-03-15 05:14:39] 22140  Archived website id 1, period = year, date = 2021-01-01, segment = 'actions>=2', 12 visits found. Time elapsed: 0.903s
INFO [2021-03-15 05:14:39] 22140  Finished archiving for site 1, 6 API requests, Time elapsed: 6.263s [1 / 1 done]
INFO [2021-03-15 05:14:40] 22140  Done archiving!

Any help appreciated (I just don't want to throw away extra hours, if someone can solve/explain it in less than 1 hour).
ping @tsteur @diosmosis @sgiehl

Thanks

@sgiehl
Copy link
Member

sgiehl commented Mar 15, 2021

@flamisz If I understood that correctly the config for process_new_segments_from is actually only used when a segment is created/updated. In this case the segment archives are invalidated for the past (depending on the setting).
The normal archiving process doesn't use the setting. So if you hadn't run the archiver before, the archiver will also archive newly created segments for periods that hadn't been processed at all before.

@tsteur
Copy link
Member Author

tsteur commented Mar 15, 2021

@flamisz not sure which config setting you used for the segments?

We can have a quick chat about it tmrw. Generally, when using eg process_new_segments_from = "segment_last_edit_time" (or any other setting that is not the default value beginning of time) then the goal would be to not archive any day before that day. If you had segment_creation_time then the archiver should not have launched that archive for the day (this be one bug), and another bug be that the day may be archived as part of a month or yearly archive.

@diosmosis
Copy link
Member

For the solution it may help to look at Cloud#250 as the solution here should be fairly similar.

@diosmosis
Copy link
Member

@flamisz if you need more context I can explain the entire problem, too. It's a bug that involves four separate systems (core archiving/aggregation logic, archive invalidation, archive processing and segment creation/updating), so it makes sense it would be hard to understand w/o context.

@flamisz
Copy link
Contributor

flamisz commented Mar 16, 2021

@diosmosis that would be great thanks. I see it's more complex than I expected and most of these are new for me.

@diosmosis
Copy link
Member

diosmosis commented Mar 16, 2021

@flamisz here is my best attempt, I'll explain how it emerges by explaining each system. The explanations are not completely in depth because there's still a lot of difference that isn't super important to know. Also I don't know how much you already know so feel free to ignore anything I'm explaining to you twice.

core archiving/aggregation

core archiving is the logic that starts in Archive.php. The Archive class let's you query archive data. If the data is not found, and the current request is allowed to initiate aggregation (eg, $_GET['trigger'] = 'archivephp' or browser initiated archiving is enabled in the UI), the process is started and individual plugin Archivers get invoked. This is handled by the Loader.php/PluginsArchiver.php classes.

The way aggregation is done is different based on period. Day periods result in LogAggregator being used to query data over log tables. This is too much data to do over weeks and months and years, so for periods that are greater than a day, we sum up the reports within the period. For example, for week periods, we sum up the days within the week. For month periods we sum up the weeks inside the month + the days at the edges. For years we use the months in the year. This is handled by the ArchiveProcessor class. Specifically, the class creates it's own Archive instance to query each subperiod: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L116-L118 . This instance is used to query data to aggregate, for example: https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor.php#L346 . The recursive use of Archive means a year archive can eventually initiate archiving for a day period, if the archive doesn't already exist.

archive invalidation/archive processing

When an archive needs to be reprocessed for some reason, it is marked as invalidated. This happens differently for browser triggered archiving and for core:archive. For browser triggered archiving, we mark existing archives in the table as DONE_INVALIDATED. When a user requests a report from one of those archives, we notice it is DONE_INVALIDATED in Loader.php and make sure not to use it, and instead initiate archiving as if there was no data. This works for browser triggered archiving because it will happen for every user request, we only need to process an archive if a user requests it.

For core:archive, however, we can't do this, we want to process all invalidated archives. And it is too expensive to query every archive table for invalidated archives each time, so we use the archive_invalidations table. We insert invalidations here and pull in the proper order to process them (the proper order being lower periods first so days, weeks, months, years & segments after the "all visits" archive). We do lower periods first so higher periods will not initiate archiving for lower periods in a single request.

When processing an invalidation, we basically create a new process (either an HTTP request or an instance of the climulti:request command) that initiates the same core archiving process via Loader.php (starting in an API method in CoreAdminHome).

Invalidating a lower period necessitates invalidating higher periods. For example, invalidating a day because it has more visits, means the week, month and year it is in, is no longer accurate.

segment creation/updating

When a segment is created/updated, we want to reprocess data for past dates. The amount of data to process is controlled by the INI config mentioned above. We do this by invalidating past data. This is only really an issue for core:archive. For browser triggered archiving, users will request the new/edited segment, and since the hash will be new/different there will be no archive data to serve. So we initiate archiving.

For core:archive, however, we invalidate dates in the past so the command will see them in the archive_invalidations table and process them. If process_new_segments_from is set to, eg, last6, we would archive the last6 months (there are other values we can use, but I'll just focus on this one).

So let's say today, on March 15th, we modify a segment. Now we want to rearchive the last 6 months. So we invalidate the days from Oct 15th => Mar 15th. This means the weeks and months also get invalidated. And also, crucially, the years, 2020 & 2021.
Note: we don't end up invalidating Jan-Oct 15th, just October 15th onwards. So the year 2020 is invalidated, but many of the months inside are not invalidated, and also archive data for it does not exist (because the segment hash changed when the segment was edited).

the bug

So the bug is this:

  • a user creates/edits a segment on today, March 15th
  • we invalidate in the past from October 15th => March 15th, the year 2020 and 2021 gets invalidated
  • core:archive archives the months and weeks and days between October 15th => March 15th
  • then core:archive archives the years 2020/2021
  • since there is no data for January for this segment, the month archive is initiated. and so are the weeks, and days in side the month. (and the same for every month until october)

This is all done in a single request so not only does it create a lot of load for a lot of unnecessary work, memory leaks can add up and cause process crashes.

how it was fixed for plugin archives

This problem also occurs for plugin archives. When certain plugins are activated/deactivated or certain entities are updated/created (eg, a funnel or custom report), we want historical data to be processed for them, so we invalidate in the past just as we do for segments. And since no data will exist in the same way, we end up doing way too much archiving.

This was fixed by introducing Archive::forceFetchingWithoutLaunchingArchiving(). When we detect one of these kind of requests (because pluginOnly=1 was set in the request), we set it, which makes sure child archives aren't re-processed if there's no data.

A similar fix would work here, however it's a bit more complicated, because there isn't a way to differentiate between whether an archiving request was made because a segment was created/updated or because the user actually invalidated in the past. For plugin archives, pluginOnly=1 is only set in this case, so we know we can do this.

I hope that helped, feel free to ask questions if something's not clear.

@flamisz
Copy link
Contributor

flamisz commented Mar 16, 2021

Thanks @diosmosis for the detailed explanation. I'll go through it and let you know if I still have question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Performance For when we could improve the performance / speed of Matomo. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants