Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cli archiving faster if only one concurrent request per website is configured #18157

Merged
merged 11 commits into from Oct 18, 2021

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Oct 14, 2021

Description:

This PR is based on #18152 which we'll want to merge first.

Previously, we were executing all cli commands asynchronous. This comes with a big overhead that we need to check the list of processes constantly, need to work with pid files, write output to disk, etc. On one of our systems we noticed this can lead to an overhead of close to 40% CPU on the archiving. It's probably bit over-reported but it had definitely a big impact and it's maybe more realistically maybe 10-20%. However, because a lot of the time is archiving spent is waiting for the dedicated DB the CPU increase percentage spent on this might be higher after all.

When core:archive runs with --concurrent-requests-per-website=1 then there's actually no need to run a command asynchronous. Instead we could simply run the command synchronous meaning we won't need to check constantly if the process is still running and then sleep for a bit and check again. Instead we will be notified immediately when the command is finished. Also we 100% accurately notice immediately if command is aborted, avoid writing result to disk, using pids etc.

This would not be a massive improvement for a regular system but for some systems where there are heaps of archivers launching constantly this can make quite an improvement.

Technically, this way we could in the future even support more systems possibly using CLI archiving as we wouldn't need awk and ps command to work anymore etc. However, changing that be a bit more complicated and it's impact is limited because it's only the case when using --concurrent-requests-per-website=1 when the default value is 3. So by default, nobody will benefit from this improvement unless that --concurrent-requests-per-website=1 is set. I wouldn't advice setting --concurrent-requests-per-website=1 for this performance improvement when previously running multiple in parallel. It be mostly a nice performance benefit if already using --concurrent-requests-per-website=1 for various reasons anyway.

Added some tests for the StaticOutput class. There are actually already tests for CliMulti when using CLI and when there is only one concurrent archiver so didn't add any other ones.

Review

@tsteur tsteur added this to the 4.6.0 milestone Oct 14, 2021
@@ -162,48 +165,67 @@ private function start($piwikUrls)

if ($shouldStart === Request::ABORT) {
// output is needed to ensure same order of url to response
$output = new Output($cmdId);
$output = new StaticOutput($cmdId);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was actually not needed to write this to disc, instead we can store this in memory

} else {
$output = new StaticOutput($cmdId);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, there was actually never a need to write this to disc, now that we have StaticOutput we don't need to use the regular output anymore (which should be named ideally like FileOutput but wasn't sure if it's used anywhere else)

@@ -415,7 +452,9 @@ private function requestUrls(array $piwikUrls)
$elapsed = time() - $startTime;
$timeToWait = $this->getTimeToWaitBeforeNextCheck($elapsed);

usleep($timeToWait);
if (!count($this->processes)) {
usleep($timeToWait);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we use HTTP or syncCli we will now no longer need to wait as we'll know we have the result already at this point and the command is finished

@tsteur tsteur changed the base branch from cache_is_supported to 4.x-dev October 14, 2021 21:55
@tsteur tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Oct 14, 2021
@tsteur tsteur added the Needs Review PRs that need a code review label Oct 15, 2021
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look through the code changes. Merged in the lastest changes from 4.x-dev. If tests are passing, guess this should be good to merge.

@tsteur tsteur merged commit b56a767 into 4.x-dev Oct 18, 2021
@tsteur tsteur deleted the synclci branch October 18, 2021 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants