Add option to specify the max number of websites to process by archiver. #18326

EreMaijala · 2021-11-16T10:20:48Z

Description:

This change makes it possible to limit the process lifetime to avoid excessive memory usage (possibly due to PHP bugs such as 79519).

Rationale:

We have an issue with the archiver and a large number (~3900) of sites where the archiver process runs out of memory. We have tried to increase memory_limit up to 12 GB to no avail, so this is an attempt to limit the lifetime of the archiver to avoid this excessive memory usage.

As a sidenote there seem to be some static or long-living arrays that consume memory during archiving, such as the transient cache, but they don't seem to explain everything. I suspect PHP issues such as 79519 contribute to this.

Review

This makes it possible to limit the process lifetime to avoid excessive memory usage (possibly due to PHP bugs such as 79519).

EreMaijala · 2021-11-16T10:22:30Z

Looks like I can't add the "Needs Review" label.

bx80

@EreMaijala Thank you for this contribution, it looks like a very useful option for installations with a large number of websites. It works well in the tests I've done and I can't see any issues. @sgiehl could you take a quick look in case I've overlooked anything?

tsteur · 2021-11-17T01:18:53Z

@EreMaijala @bx80 I wonder if it would generally make more sense to stop after a certain amount of archives that were processed? This way it may work better (for example if one site has heaps of invalidations and therefore lots to archive which can happen eg when creating/changing a segment) and it would potentially also work better for other people with less sites but similar issue? Not sure if that would help or make any sense?

EreMaijala · 2021-11-17T12:19:35Z

@tsteur Stopping after a certain number of archives would probably be even better, but as the process currently works, the next execution seems to always pick the next site instead of next archive to process, and it would need much more extensive changes to make it work smoothly with archives.

EreMaijala · 2021-11-17T12:27:52Z

@tsteur To elaborate on the previous: since the queue contains site IDs and each archiver takes a site ID from the queue to process its archives, it's not easy to stop in the middle of a site. Perhaps it would be possible to put the site back into the queue, but that feels like a hack to me.

tsteur · 2021-11-17T20:48:59Z

@EreMaijala not sure if it helps, I was thinking something like this might work:

diff --git a/core/CronArchive.php b/core/CronArchive.php
index bbdd8a691..a48e1098c 100644
--- a/core/CronArchive.php
+++ b/core/CronArchive.php
@@ -389,6 +389,13 @@ class CronArchive
 
             $successCount = $this->launchArchivingFor($archivesToProcess, $queueConsumer);
             $numArchivesFinished += $successCount;
+
+
+            if ($this->maxArchivesToProcess && $numArchivesFinished >= $this->maxArchivesToProcess) {
+                $this->logger->info("Maximum number of archives to process per execution has been reached.");
+                break;
+            }
+
         }

It's not the full code as maxArchivesToProcess isn't set initially. Also haven't tested it.

EreMaijala · 2021-11-18T10:49:02Z

@tsteur I believe that's what I initially tried, but that resulted in the situation I described, where the execution can stop in the middle of a site, and the next execution then starts from the next site. As far as I can see, archiver will detect the skipped ones after a day (with the message "Found X failed jobs (ts_invalidated older than 1 day), resetting status to try them again."), but that means that their processing will be delayed, and potentially skipped again if processing stops at the right moment.

To illustrate this, I added some output and a hard-coded limit of 10 archives:

            $successCount = $this->launchArchivingFor($archivesToProcess, $queueConsumer);
            $numArchivesFinished += $successCount;
            echo "Archives finished: $numArchivesFinished\nLast processed:\n";
            foreach ($archivesToProcess as $current) {
                echo "Site: {$current['idsite']}, archive: {$current['idarchive']}, invalidation: {$current['idinvalidation']}, period: {$current['period']}\n";
            }
            if ($numArchivesFinished >= 10) {
                echo "STOP\n";
                break;
            }

Results from first execution:

INFO [2021-11-18 10:27:52] 61920  START
INFO [2021-11-18 10:27:52] 61920  Starting Matomo reports archiving...
INFO [2021-11-18 10:27:52] 61920  Start processing archives for site 423.
INFO [2021-11-18 10:27:53] 61920  Archived website id 423, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.822s
Archives finished: 1
Last processed:
Site: 423, archive: , invalidation: 1765, period: 2
INFO [2021-11-18 10:27:53] 61920  Archived website id 423, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.584s
Archives finished: 2
Last processed:
Site: 423, archive: , invalidation: 1766, period: 3
INFO [2021-11-18 10:27:54] 61920  Archived website id 423, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.570s
Archives finished: 3
Last processed:
Site: 423, archive: , invalidation: 1768, period: 4
INFO [2021-11-18 10:27:54] 61920  Finished archiving for site 423, 3 API requests, Time elapsed: 2.445s [1 / 580 done]
INFO [2021-11-18 10:27:54] 61920  Start processing archives for site 424.
INFO [2021-11-18 10:27:55] 61920  Archived website id 424, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.805s
Archives finished: 4
Last processed:
Site: 424, archive: , invalidation: 1769, period: 2
INFO [2021-11-18 10:27:56] 61920  Archived website id 424, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.574s
Archives finished: 5
Last processed:
Site: 424, archive: , invalidation: 1770, period: 3
INFO [2021-11-18 10:27:57] 61920  Archived website id 424, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.586s
Archives finished: 6
Last processed:
Site: 424, archive: , invalidation: 1772, period: 4
INFO [2021-11-18 10:27:57] 61920  Finished archiving for site 424, 3 API requests, Time elapsed: 2.413s [2 / 580 done]
INFO [2021-11-18 10:27:57] 61920  Start processing archives for site 425.
INFO [2021-11-18 10:27:58] 61920  Archived website id 425, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.820s
Archives finished: 7
Last processed:
Site: 425, archive: , invalidation: 1773, period: 2
INFO [2021-11-18 10:27:58] 61920  Archived website id 425, period = month, date = 2021-11-01, segment = '', 1 visits found. Time elapsed: 0.574s
Archives finished: 8
Last processed:
Site: 425, archive: , invalidation: 1774, period: 3
INFO [2021-11-18 10:27:59] 61920  Archived website id 425, period = year, date = 2021-01-01, segment = '', 1 visits found. Time elapsed: 0.582s
Archives finished: 9
Last processed:
Site: 425, archive: , invalidation: 1776, period: 4
INFO [2021-11-18 10:27:59] 61920  Finished archiving for site 425, 3 API requests, Time elapsed: 2.421s [3 / 580 done]
INFO [2021-11-18 10:27:59] 61920  Start processing archives for site 426.
INFO [2021-11-18 10:28:00] 61920  Archived website id 426, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.821s
Archives finished: 10
Last processed:
Site: 426, archive: , invalidation: 1777, period: 2
STOP
INFO [2021-11-18 10:28:00] 61920  Done archiving!

Results from second execution:

INFO [2021-11-18 10:28:10] 61981  START
INFO [2021-11-18 10:28:10] 61981  Starting Matomo reports archiving...
INFO [2021-11-18 10:28:10] 61981  Start processing archives for site 427.
INFO [2021-11-18 10:28:11] 61981  Archived website id 427, period = week, date = 2021-11-15, segment = '', 1 visits found. Time elapsed: 0.832s
Archives finished: 1
Last processed:
Site: 427, archive: , invalidation: 1781, period: 2
[...]

It starts from the next site (427) instead of finishing periods 3 and 4 for site 426. This is the reason I made it stop after processing all archives for a site.

This said, it would be possible to do the check only when getNextArchivesToProcess returns an empty array, which indicates that it has completed archiving for the current website. That, however, would make the point where it stops inexact, more like 'Stop after processing at least X archives', which would make it not so different from max sites to process, but might be easier to use when there are lots of invalidations but only occasionally. I'll add that as an option, and you can then deside whether to drop either one. For us it doesn't make much difference as we can usually process something like 3000 invalidations before memory_limit is exceeded.

EreMaijala · 2021-11-18T11:15:40Z

@tsteur New option added. I think I'd prefer to keep both options since that'd allow specifying both and stop processing when either limit has been reached. I also considered naming the option something else than max-archives-to-process, but decided against that since it could perhaps become a hard maximum if the queue processing changes in future.

tsteur · 2021-11-18T19:34:23Z

That, however, would make the point where it stops inexact, more like 'Stop after processing at least X archives', which would make it not so different from max sites to process, but might be easier to use when there are lots of invalidations but only occasionally.

I reckon it wouldn't be a big issue since most of the time it's only launching like 3 or so in parallel I think.

believe that's what I initially tried, but that resulted in the situation I described, where the execution can stop in the middle of a site, and the next execution then starts from the next site. ... It starts from the next site (427) instead of finishing periods 3 and 4 for site 426. This is the reason I made it stop after processing all archives for a site.

Indeed. It would continue with the next sites, then once all the other sites have been worked on, continue again from the beginning and process the ones that haven't been processed yet. Depending how quick the archiving is, and how many sites there are, it would continue the aborted site in say 2 hours or maybe later. It should usually not be an issue but it depends on the situation.

EreMaijala · 2021-11-19T09:01:17Z

@tsteur Ah, I've missed that part and thought it will always take 24 hours for it to resume. It's not that bad then, sorry for the confusion. I'm still not convinced that it's not a problem at all as you could get a bit wonky results when looking at analytics for a site that's only partially archived, no? I mean that it's obviously not processed if the information is not available at all, but if you have like day and week but no month or year for several hours, it could be confusing.

tsteur · 2021-11-21T19:11:42Z

@EreMaijala it totally depends on the users of Matomo. For us it's not an issue and we're even calculating on our own Matomo month and year for example less often as having it up to date is not as import. Like if the data in the year is say 2-4 hours behind, out of up to 8800 hours, then it's too big of an issue. If you were to some up every week and day manually and then come to a different result maybe as the month period, then it can be confusing.

EreMaijala · 2021-11-22T07:23:42Z

@tsteur Fair enough. I've now updated this to make --max-archives-to-process a hard limit while keeping --max-websites-to-process as well. I believe this allows enough flexibility for most use cases.

bx80 · 2021-11-24T03:48:10Z

This all looks good to me and works well in testing, thanks @EreMaijala for adding both options. @tsteur Okay to merge?

tsteur · 2021-11-24T03:59:33Z

Yes, all good to merge. Technically we might want to create an FAQ for it. Maybe needs to be mentioned at least on https://matomo.org/docs/setup-auto-archiving/

bx80 · 2021-11-25T02:10:16Z

I've updated the core:archive help output in the setup auto archiving user guide and also created a draft FAQ explaining how these options work and where they should be added.

EreMaijala · 2021-11-25T07:38:41Z

Thank you all!

Add option to specify the max number of websites to process by archiver.

0d21f64

This makes it possible to limit the process lifetime to avoid excessive memory usage (possibly due to PHP bugs such as 79519).

Findus23 added the Needs Review PRs that need a code review label Nov 16, 2021

bx80 added this to the 4.7.0 milestone Nov 17, 2021

bx80 approved these changes Nov 17, 2021

View reviewed changes

Add an option to limit the number of archives to process.

dc335e7

Clean up a whitespace change.

7296852

Make --max-archives-to-process a hard limit.

5828af8

bx80 merged commit e43d2d5 into matomo-org:4.x-dev Nov 25, 2021

bx80 mentioned this pull request Dec 7, 2021

Big memory usage during archive #18460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to specify the max number of websites to process by archiver. #18326

Add option to specify the max number of websites to process by archiver. #18326

EreMaijala commented Nov 16, 2021

EreMaijala commented Nov 16, 2021

bx80 left a comment

tsteur commented Nov 17, 2021

EreMaijala commented Nov 17, 2021

EreMaijala commented Nov 17, 2021

tsteur commented Nov 17, 2021

EreMaijala commented Nov 18, 2021

EreMaijala commented Nov 18, 2021

tsteur commented Nov 18, 2021

EreMaijala commented Nov 19, 2021

tsteur commented Nov 21, 2021

EreMaijala commented Nov 22, 2021

bx80 commented Nov 24, 2021

tsteur commented Nov 24, 2021

bx80 commented Nov 25, 2021

EreMaijala commented Nov 25, 2021

Add option to specify the max number of websites to process by archiver. #18326

Add option to specify the max number of websites to process by archiver. #18326

Conversation

EreMaijala commented Nov 16, 2021

Description:

Rationale:

Review

EreMaijala commented Nov 16, 2021

bx80 left a comment

Choose a reason for hiding this comment

tsteur commented Nov 17, 2021

EreMaijala commented Nov 17, 2021

EreMaijala commented Nov 17, 2021

tsteur commented Nov 17, 2021

EreMaijala commented Nov 18, 2021

EreMaijala commented Nov 18, 2021

tsteur commented Nov 18, 2021

EreMaijala commented Nov 19, 2021

tsteur commented Nov 21, 2021

EreMaijala commented Nov 22, 2021

bx80 commented Nov 24, 2021

tsteur commented Nov 24, 2021

bx80 commented Nov 25, 2021

EreMaijala commented Nov 25, 2021