Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify the max number of websites to process by archiver. #18326

Merged
merged 4 commits into from Nov 25, 2021

Conversation

EreMaijala
Copy link
Contributor

Description:

This change makes it possible to limit the process lifetime to avoid excessive memory usage (possibly due to PHP bugs such as 79519).

Rationale:

We have an issue with the archiver and a large number (~3900) of sites where the archiver process runs out of memory. We have tried to increase memory_limit up to 12 GB to no avail, so this is an attempt to limit the lifetime of the archiver to avoid this excessive memory usage.

As a sidenote there seem to be some static or long-living arrays that consume memory during archiving, such as the transient cache, but they don't seem to explain everything. I suspect PHP issues such as 79519 contribute to this.

Review

This makes it possible to limit the process lifetime to avoid excessive memory usage (possibly due to PHP bugs such as 79519).
@EreMaijala
Copy link
Contributor Author

Looks like I can't add the "Needs Review" label.

@Findus23 Findus23 added the Needs Review PRs that need a code review label Nov 16, 2021
@bx80 bx80 added this to the 4.7.0 milestone Nov 17, 2021
Copy link
Contributor

@bx80 bx80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EreMaijala Thank you for this contribution, it looks like a very useful option for installations with a large number of websites. It works well in the tests I've done and I can't see any issues. @sgiehl could you take a quick look in case I've overlooked anything?

@tsteur
Copy link
Member

tsteur commented Nov 17, 2021

@EreMaijala @bx80 I wonder if it would generally make more sense to stop after a certain amount of archives that were processed? This way it may work better (for example if one site has heaps of invalidations and therefore lots to archive which can happen eg when creating/changing a segment) and it would potentially also work better for other people with less sites but similar issue? Not sure if that would help or make any sense?

@EreMaijala
Copy link
Contributor Author

@tsteur Stopping after a certain number of archives would probably be even better, but as the process currently works, the next execution seems to always pick the next site instead of next archive to process, and it would need much more extensive changes to make it work smoothly with archives.

@EreMaijala
Copy link
Contributor Author

@tsteur To elaborate on the previous: since the queue contains site IDs and each archiver takes a site ID from the queue to process its archives, it's not easy to stop in the middle of a site. Perhaps it would be possible to put the site back into the queue, but that feels like a hack to me.

@tsteur
Copy link
Member

tsteur commented Nov 17, 2021

@EreMaijala not sure if it helps, I was thinking something like this might work:

diff --git a/core/CronArchive.php b/core/CronArchive.php
index bbdd8a691..a48e1098c 100644
--- a/core/CronArchive.php
+++ b/core/CronArchive.php
@@ -389,6 +389,13 @@ class CronArchive
 
             $successCount = $this->launchArchivingFor($archivesToProcess, $queueConsumer);
             $numArchivesFinished += $successCount;
+
+
+            if ($this->maxArchivesToProcess && $numArchivesFinished >= $this->maxArchivesToProcess) {
+                $this->logger->info("Maximum number of archives to process per execution has been reached.");
+                break;
+            }
+
         }

It's not the full code as maxArchivesToProcess isn't set initially. Also haven't tested it.

@EreMaijala
Copy link
Contributor Author

@tsteur I believe that's what I initially tried, but that resulted in the situation I described, where the execution can stop in the middle of a site, and the next execution then starts from the next site. As far as I can see, archiver will detect the skipped ones after a day (with the message "Found X failed jobs (ts_invalidated older than 1 day), resetting status to try them again."), but that means that their processing will be delayed, and potentially skipped again if processing stops at the right moment.

To illustrate this, I added some output and a hard-coded limit of 10 archives:

            $successCount = $this->launchArchivingFor($archivesToProcess, $queueConsumer);
            $numArchivesFinished += $successCount;
            echo "Archives finished: $numArchivesFinished\nLast processed:\n";
            foreach ($archivesToProcess as $current) {
                echo "Site: {$current['idsite']}, archive: {$current['idarchive']}, invalidation: {$current['idinvalidation']}, period: {$current['period']}\n";
            }
            if ($numArchivesFinished >= 10) {
                echo "STOP\n";
                break;
            }

Results from first execution:

INFO [2021-11-18 10:27:52] 61920  START
INFO [2021-11-18 10:27:52] 61920  Starting Matomo reports archiving...
INFO [2021-11-18 10:27:52] 61920  Start processing archives for site 423.
INFO [2021-11-18 10:27:53] 61920  Archived website id 423, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.822s
Archives finished: 1
Last processed:
Site: 423, archive: , invalidation: 1765, period: 2
INFO [2021-11-18 10:27:53] 61920  Archived website id 423, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.584s
Archives finished: 2
Last processed:
Site: 423, archive: , invalidation: 1766, period: 3
INFO [2021-11-18 10:27:54] 61920  Archived website id 423, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.570s
Archives finished: 3
Last processed:
Site: 423, archive: , invalidation: 1768, period: 4
INFO [2021-11-18 10:27:54] 61920  Finished archiving for site 423, 3 API requests, Time elapsed: 2.445s [1 / 580 done]
INFO [2021-11-18 10:27:54] 61920  Start processing archives for site 424.
INFO [2021-11-18 10:27:55] 61920  Archived website id 424, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.805s
Archives finished: 4
Last processed:
Site: 424, archive: , invalidation: 1769, period: 2
INFO [2021-11-18 10:27:56] 61920  Archived website id 424, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.574s
Archives finished: 5
Last processed:
Site: 424, archive: , invalidation: 1770, period: 3
INFO [2021-11-18 10:27:57] 61920  Archived website id 424, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.586s
Archives finished: 6
Last processed:
Site: 424, archive: , invalidation: 1772, period: 4
INFO [2021-11-18 10:27:57] 61920  Finished archiving for site 424, 3 API requests, Time elapsed: 2.413s [2 / 580 done]
INFO [2021-11-18 10:27:57] 61920  Start processing archives for site 425.
INFO [2021-11-18 10:27:58] 61920  Archived website id 425, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.820s
Archives finished: 7
Last processed:
Site: 425, archive: , invalidation: 1773, period: 2
INFO [2021-11-18 10:27:58] 61920  Archived website id 425, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.574s
Archives finished: 8
Last processed:
Site: 425, archive: , invalidation: 1774, period: 3
INFO [2021-11-18 10:27:59] 61920  Archived website id 425, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.582s
Archives finished: 9
Last processed:
Site: 425, archive: , invalidation: 1776, period: 4
INFO [2021-11-18 10:27:59] 61920  Finished archiving for site 425, 3 API requests, Time elapsed: 2.421s [3 / 580 done]
INFO [2021-11-18 10:27:59] 61920  Start processing archives for site 426.
INFO [2021-11-18 10:28:00] 61920  Archived website id 426, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.821s
Archives finished: 10
Last processed:
Site: 426, archive: , invalidation: 1777, period: 2
STOP
INFO [2021-11-18 10:28:00] 61920  Done archiving!

Results from second execution:

INFO [2021-11-18 10:28:10] 61981  START
INFO [2021-11-18 10:28:10] 61981  Starting Matomo reports archiving...
INFO [2021-11-18 10:28:10] 61981  Start processing archives for site 427.
INFO [2021-11-18 10:28:11] 61981  Archived website id 427, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.832s
Archives finished: 1
Last processed:
Site: 427, archive: , invalidation: 1781, period: 2
[...]

It starts from the next site (427) instead of finishing periods 3 and 4 for site 426. This is the reason I made it stop after processing all archives for a site.

This said, it would be possible to do the check only when getNextArchivesToProcess returns an empty array, which indicates that it has completed archiving for the current website. That, however, would make the point where it stops inexact, more like 'Stop after processing at least X archives', which would make it not so different from max sites to process, but might be easier to use when there are lots of invalidations but only occasionally. I'll add that as an option, and you can then deside whether to drop either one. For us it doesn't make much difference as we can usually process something like 3000 invalidations before memory_limit is exceeded.

@EreMaijala
Copy link
Contributor Author

@tsteur New option added. I think I'd prefer to keep both options since that'd allow specifying both and stop processing when either limit has been reached. I also considered naming the option something else than max-archives-to-process, but decided against that since it could perhaps become a hard maximum if the queue processing changes in future.

@tsteur
Copy link
Member

tsteur commented Nov 18, 2021

That, however, would make the point where it stops inexact, more like 'Stop after processing at least X archives', which would make it not so different from max sites to process, but might be easier to use when there are lots of invalidations but only occasionally.

I reckon it wouldn't be a big issue since most of the time it's only launching like 3 or so in parallel I think.

believe that's what I initially tried, but that resulted in the situation I described, where the execution can stop in the middle of a site, and the next execution then starts from the next site. ... It starts from the next site (427) instead of finishing periods 3 and 4 for site 426. This is the reason I made it stop after processing all archives for a site.

Indeed. It would continue with the next sites, then once all the other sites have been worked on, continue again from the beginning and process the ones that haven't been processed yet. Depending how quick the archiving is, and how many sites there are, it would continue the aborted site in say 2 hours or maybe later. It should usually not be an issue but it depends on the situation.

@EreMaijala
Copy link
Contributor Author

@tsteur Ah, I've missed that part and thought it will always take 24 hours for it to resume. It's not that bad then, sorry for the confusion. I'm still not convinced that it's not a problem at all as you could get a bit wonky results when looking at analytics for a site that's only partially archived, no? I mean that it's obviously not processed if the information is not available at all, but if you have like day and week but no month or year for several hours, it could be confusing.

@tsteur
Copy link
Member

tsteur commented Nov 21, 2021

@EreMaijala it totally depends on the users of Matomo. For us it's not an issue and we're even calculating on our own Matomo month and year for example less often as having it up to date is not as import. Like if the data in the year is say 2-4 hours behind, out of up to 8800 hours, then it's too big of an issue. If you were to some up every week and day manually and then come to a different result maybe as the month period, then it can be confusing.

@EreMaijala
Copy link
Contributor Author

@tsteur Fair enough. I've now updated this to make --max-archives-to-process a hard limit while keeping --max-websites-to-process as well. I believe this allows enough flexibility for most use cases.

@bx80
Copy link
Contributor

bx80 commented Nov 24, 2021

This all looks good to me and works well in testing, thanks @EreMaijala for adding both options. @tsteur Okay to merge?

@tsteur
Copy link
Member

tsteur commented Nov 24, 2021

Yes, all good to merge. Technically we might want to create an FAQ for it. Maybe needs to be mentioned at least on https://matomo.org/docs/setup-auto-archiving/
image

@bx80 bx80 merged commit e43d2d5 into matomo-org:4.x-dev Nov 25, 2021
@bx80
Copy link
Contributor

bx80 commented Nov 25, 2021

I've updated the core:archive help output in the setup auto archiving user guide and also created a draft FAQ explaining how these options work and where they should be added.

@EreMaijala
Copy link
Contributor Author

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants