Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Archive.php archive invalidation to Loader… #15616

Merged
merged 19 commits into from Mar 10, 2020

Conversation

diosmosis
Copy link
Member

… so we only invalidate when about to launch archiving.

Not sure if there are tests I can add for this just yet.

@diosmosis diosmosis added the Needs Review PRs that need a code review label Feb 24, 2020
@diosmosis diosmosis added this to the 3.13.4 milestone Feb 24, 2020
@sgiehl
Copy link
Member

sgiehl commented Feb 24, 2020

Seems a test is failing now. Otherwise guess looks good to merge

@tsteur
Copy link
Member

tsteur commented Feb 24, 2020

Here's some thoughts maybe that would likely improve our problem but not really 100% sure if it makes sense and would work @diosmosis

diff --git a/core/ArchiveProcessor/Loader.php b/core/ArchiveProcessor/Loader.php
index c99dd3d02e..d6f896fee6 100644
--- a/core/ArchiveProcessor/Loader.php
+++ b/core/ArchiveProcessor/Loader.php
@@ -79,15 +79,30 @@ class Loader
     {
         $this->params->setRequestedPlugin($pluginName);
 
+        // no valid archive exists, make sure to invalidate existing archives
+        if (($isBrowserArchiveEnabled || $isArchiveTriggered) && $this->invalidateBeforeArchiving) {
+            $this->invalidatedReportsIfNeeded(); // SHOULD INVALIDATE ALL REPORTS
+        }
+
         list($idArchive, $visits, $visitsConverted) = $this->loadExistingArchiveIdFromDb();
         if (!empty($idArchive)) {
+            // IF REPORT INCLUDES TODAY....
+                // IF ARCHIVE WAS GENERATED WITHIN LAST time_before_today_archive_considered_outdated SECONDS DO NOT RUN
+                // INVALIDATE LOGIC SINCE REPORT WOULD NOT BE REGENERATED ANYWAY
+
+                // IF ARCHIVE WAS GENERATED MORE THAN time_before_today_archive_considered_outdated SECONDS AGO,
+                // RUN INVALIDATION LOGIC AND CALL loadExistingArchiveIdFromDb() again in case it is no longer valid
+
+            // IF REPORT NOT INCLUDES TODAY
+                // IDEALLY WE INVALIDATE REPORTS ONLY THAT AFFECT THAT PERIOD TO NOT SLOW DOWN EG RANGE REQUESTS
+
             return $idArchive;
         }
 
-        // no valid archive exists, make sure to invalidate existing archives
-        if ($this->invalidateBeforeArchiving) {
-            $this->invalidatedReportsIfNeeded();
-        }
+        // NO NEED TO INVALIDATE IF THE ARCHIVE DOES NOT EXIST SINCE IT WOULDN'T MAKE A DIFFERENCE TO THE RESULT. MEANING
+        // IF WE DID INVALIDATE REPORTS, THEN THE OUTPUT BE STILL THE SAME. EG THIS WAY WE AVOID INVALIDATING REPORTS WHEN
+        // RANGE IS SELECTED AND THE REPORT DOES NOT EXIST YET. AS A RESULT OF THIS WE WOULD ALSO NOT INVALIDATE AND
+        // REARCHIVE ANY SUBPERIODS
 
         /** @var ArchivingStatus $archivingStatus */
         $archivingStatus = StaticContainer::get(ArchivingStatus::class);

Generally speaking eg on Cloud:

  • Browser archiving and segment archiving is disabled so only range periods are causing us trouble and need to ideally avoid invalidating reports for range dates when the archive doesn't exist or when it was created recently
  • The problem with invalidation of reports is that it's fairly slow, plus causes the tracker cache general to be emptied which we want to avoid whenever possible.

While many other users have browser archiving disabled, segment archiving is still enabled etc. For these users it be great to invalidate all reports only when isArchivePhpTriggered(). In all other cases to keep UI fast and not invalidate tracker caches too often we should only invalidate reports when needed as well see comments.

When browser archiving is enabled, it'll be a bit of a problem since we often have to invalidate reports in order to get accurate results unless some report was generated recently (say within 900s if default config is used). This might be still fine though. It'll just make Matomo even slower but maybe not possible to avoid it until we are in Matomo 4.

@diosmosis
Copy link
Member Author

@tsteur I tried making changes to handle every possible case for when archiving, can you take a look? I haven't modified tests yet, you can probably just look at the latest commit.

if (!isset($result['idarchive'])
|| !isset($result['nb_visits'])
) {
return [false, false, false, true];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw here it returned for me isAnyArchiveExists = true even though there was no archive with a done flag. Only an archive with a nb_visits. This can happen if eg the archiving fails before the done flag was written but after nb_visits was written. Not sure it's important.

core/DataAccess/ArchiveSelector.php Show resolved Hide resolved
@diosmosis diosmosis force-pushed the archive-php-invalidation-opt branch from 0117b39 to b6de30f Compare March 1, 2020 07:05
@mattab mattab added the c: Performance For when we could improve the performance / speed of Matomo. label Mar 1, 2020
@diosmosis diosmosis force-pushed the archive-php-invalidation-opt branch 2 times, most recently from 5deb1ac to 052d824 Compare March 1, 2020 12:42
@diosmosis diosmosis force-pushed the archive-php-invalidation-opt branch from 052d824 to 2dd4958 Compare March 2, 2020 12:33
@diosmosis
Copy link
Member Author

diosmosis commented Mar 3, 2020

@tsteur noticed an issue w/ this change, if we invalidate only if there is something to invalidate, we may end up archiving more than once in some situations. For example:

  • archive for date 2012-02-02 and site 1 does not exist
  • tracking request comes in for that date & site, and pair is added to option list
  • archive for date 2012-02-02 and site 1 is requested
  • there is no existing archive, so nothing is invalidated
  • archiving is initiated for the day 2012-02-02, site 1
  • another archive request comes in for 2012-02-02, site 1
  • there is now an existing archive, and the date/site is in the list of pairs to archive, so we invalidate it
  • archiving is run again for 2012-02-02, site 1

We can't remove the date/site in the list in this case since it could also invalidate other archives that are not currently being requested.

The only way to do this change would be to run the invalidate query as a SELECT first then invalidate if it returns something, but that would of course be pointless.

NOTE: this only affects browser archiving, btw. Also with the 'today' change it may end up being a bit of an edge case... only other issue is that some archives may not be invalidated if they are never requested. Which would only be an issue if the options in the list are removed. We can have a daily task in that case to just run the invalidation I guess.

@tsteur
Copy link
Member

tsteur commented Mar 3, 2020

Which would only be an issue if the options in the list are removed. We can have a daily task in that case to just run the invalidation I guess.

Daily task for this should work. 👍

@tsteur
Copy link
Member

tsteur commented Mar 3, 2020

Or maybe at the beginning of the archiving we still need to invalidate all reports like we used to for cron archiving. For browser archiving handling would need a task indeed.

@diosmosis
Copy link
Member Author

Or maybe at the beginning of the archiving we still need to invalidate all reports like we used to for cron archiving.

We still do this, I haven't changed that (I disabled the Archive.php code for invalidation if browser archiving is enabled, so it's all done by core:archive, which I think is better IMO).

@diosmosis
Copy link
Member Author

@tsteur tests should pass now, ready for a review. I may add some more tests in the meantime.

@@ -81,6 +88,17 @@ public function schedule()
$this->scheduleTrackingCodeReminderChecks();
}

public function invalidateOutdatedArchives()
{
if (false && !Rules::isBrowserTriggerEnabled()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the false is probably debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is still there @diosmosis ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, must've missed it? not sure how

@@ -242,4 +271,33 @@ private function getIdSitesToArchiveWhenNoVisits()

return $cache->fetch($cacheKey);
}

private function invalidatedReportsIfNeeded()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis might be good to add some tests for this maybe? Like it could return the invalidated date/siteIds which we then can test against?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests

$isLast = preg_match('/^last([0-9]+)/', $date, $matches);

// don't do this check for a single period that includes today
if ($isTodayIncluded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis not 100% understanding this part. today would be mostly included right? I understand that we can assume basically there is no valid archive for today because we archive each time anyway. I'm just wondering if it would still need to process below code in case

  • Cron Archive starts
  • The default date is last2
  • We invalidate all reports
  • We are about to archive a site
  • This logic runs
  • Now if previously some reports from 10 days ago are invalidated, we would need to change it to last10 basically

Not sure it's clear what I mean? Or maybe this is not needed anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here was if it was a single period like, period = week, day = today, then we don't check for an existing usable archive. If it is for a date in the past, then we check if there is a usable archive and if so we don't need to archive.

For lastN values, however, it is more complicated. Here, we want to check for every date except today. So we check for valid archives. If the date is last5 and for the 5th, 4th and 3rd day we have valid archives, then we change it to last2. The only difference is if we have valid archives for every day, we still do last1 (I think the min is last1 now but we can change it back to last2).

Maybe I can add some tests for this code too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests, maybe you can better see what it's trying to do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fully getting it now 👍

$newDate = 'last' . min($lastNValue, $newRangePeriod->getNumberOfSubperiods());
} else if ($isTodayIncluded) {
$isThereArchiveForAllPeriods = false;
$newDate = 'last1';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw not sure but might still need to use last2 here to workaround some archiving problems where it would maybe create a range date for today and some plugins may have trouble with it. I suppose maybe shouldn't hurt to use last2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this too, but I wasn't sure... I'll change it just in case.

} else if ($isTodayIncluded) {
$isThereArchiveForAllPeriods = false;
$newDate = 'last1';
}
Copy link
Member

@tsteur tsteur Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw... I was thinking... if it is last1 (aka today, this week, this month), then we could technically get the date when this archive was last generated, and if it was generated less than time_before_(today|$period)_archive_considered_outdated seconds ago, return $isThereArchiveForAllPeriods = true and skip the archive this time.

Only if it's easy to do though. Otherwise could also just launch the archiving. On cloud where we consider archives outdated sometimes only evert 6, 12, 18 hours (depending on config), then it could save us launching quite a few archive launches.

Not sure it's clear what I mean? It be probably similar to what we have in browser archiving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already done, regardless of whether browser archiving is used or not. In Loader.php we always use the result of getMinTimeArchiveProcessed(). So it should always look for an archive that is out of date, correct? I could maybe add a debug log for the future...

@tsteur
Copy link
Member

tsteur commented Mar 9, 2020

👍 left another comment @diosmosis few tests are still failing but I suppose you still work in it. eg https://travis-ci.org/matomo-org/matomo/jobs/659112026?utm_medium=notification&utm_source=github_status

@diosmosis
Copy link
Member Author

@tsteur fixed some tests, there are some unrelated failures though

@@ -85,6 +85,10 @@ private function getStackTrace($exception)

public static function getWholeBacktrace(\Exception $exception, $shouldPrintBacktrace = true)
{
if (!$shouldPrintBacktrace) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis should this one still be needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, otherwise the backtrace is always printed. This was an unrelated bug I found.

@diosmosis diosmosis merged commit 22cde64 into 3.x-dev Mar 10, 2020
@diosmosis diosmosis deleted the archive-php-invalidation-opt branch March 10, 2020 00:21
jonasgrilleres pushed a commit to 1024pix/pix-analytics that referenced this pull request Sep 22, 2020
* Move Archive.php archive invalidation to Loader so we only invalidate when about to launch archiving.

* Attempt to handle more cases when invalidating before launching archiving.

* fix possible sql error

* fix possible error

* fixing some tests

* remove test code

* Only invalidate specific archive being requested.

* Do not invalidate on today in tracker and avoid existing valid archive check in CronArchive.

* more test fixes

* Attempt to fix more tests.

* Fixing last tests.

* another test fix

* Invalidate in scheduled task if browser triggered archiving is enabled.

* deal with TODO

* Get ArchiveSelectorTest to pass.

* applying review feedback including new tests

* apply review feedback & fix tests

* fix couple more tests

Co-authored-by: Thomas Steur <tsteur@users.noreply.github.com>
jbuget pushed a commit to 1024pix/pix-analytics that referenced this pull request Sep 26, 2020
* Move Archive.php archive invalidation to Loader so we only invalidate when about to launch archiving.

* Attempt to handle more cases when invalidating before launching archiving.

* fix possible sql error

* fix possible error

* fixing some tests

* remove test code

* Only invalidate specific archive being requested.

* Do not invalidate on today in tracker and avoid existing valid archive check in CronArchive.

* more test fixes

* Attempt to fix more tests.

* Fixing last tests.

* another test fix

* Invalidate in scheduled task if browser triggered archiving is enabled.

* deal with TODO

* Get ArchiveSelectorTest to pass.

* applying review feedback including new tests

* apply review feedback & fix tests

* fix couple more tests

Co-authored-by: Thomas Steur <tsteur@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants