@diosmosis opened this Pull Request on February 24th 2020 Member

… so we only invalidate when about to launch archiving.

Not sure if there are tests I can add for this just yet.

@sgiehl commented on February 24th 2020 Member

Seems a test is failing now. Otherwise guess looks good to merge

@tsteur commented on February 24th 2020 Member

Here's some thoughts maybe that would likely improve our problem but not really 100% sure if it makes sense and would work @diosmosis

diff --git a/core/ArchiveProcessor/Loader.php b/core/ArchiveProcessor/Loader.php
index c99dd3d02e..d6f896fee6 100644
--- a/core/ArchiveProcessor/Loader.php
+++ b/core/ArchiveProcessor/Loader.php
@@ -79,15 +79,30 @@ class Loader
     {
         $this->params->setRequestedPlugin($pluginName);

+        // no valid archive exists, make sure to invalidate existing archives
+        if (($isBrowserArchiveEnabled || $isArchiveTriggered) && $this->invalidateBeforeArchiving) {
+            $this->invalidatedReportsIfNeeded(); // SHOULD INVALIDATE ALL REPORTS
+        }
+
         list($idArchive, $visits, $visitsConverted) = $this->loadExistingArchiveIdFromDb();
         if (!empty($idArchive)) {
+            // IF REPORT INCLUDES TODAY....
+                // IF ARCHIVE WAS GENERATED WITHIN LAST time_before_today_archive_considered_outdated SECONDS DO NOT RUN
+                // INVALIDATE LOGIC SINCE REPORT WOULD NOT BE REGENERATED ANYWAY
+
+                // IF ARCHIVE WAS GENERATED MORE THAN time_before_today_archive_considered_outdated SECONDS AGO,
+                // RUN INVALIDATION LOGIC AND CALL loadExistingArchiveIdFromDb() again in case it is no longer valid
+
+            // IF REPORT NOT INCLUDES TODAY
+                // IDEALLY WE INVALIDATE REPORTS ONLY THAT AFFECT THAT PERIOD TO NOT SLOW DOWN EG RANGE REQUESTS
+
             return $idArchive;
         }

-        // no valid archive exists, make sure to invalidate existing archives
-        if ($this->invalidateBeforeArchiving) {
-            $this->invalidatedReportsIfNeeded();
-        }
+        // NO NEED TO INVALIDATE IF THE ARCHIVE DOES NOT EXIST SINCE IT WOULDN'T MAKE A DIFFERENCE TO THE RESULT. MEANING
+        // IF WE DID INVALIDATE REPORTS, THEN THE OUTPUT BE STILL THE SAME. EG THIS WAY WE AVOID INVALIDATING REPORTS WHEN
+        // RANGE IS SELECTED AND THE REPORT DOES NOT EXIST YET. AS A RESULT OF THIS WE WOULD ALSO NOT INVALIDATE AND
+        // REARCHIVE ANY SUBPERIODS

         /** <a class='mention' href='https://github.com/var'>@var</a> ArchivingStatus $archivingStatus */
         $archivingStatus = StaticContainer::get(ArchivingStatus::class);

Generally speaking eg on Cloud:

  • Browser archiving and segment archiving is disabled so only range periods are causing us trouble and need to ideally avoid invalidating reports for range dates when the archive doesn't exist or when it was created recently
  • The problem with invalidation of reports is that it's fairly slow, plus causes the tracker cache general to be emptied which we want to avoid whenever possible.

While many other users have browser archiving disabled, segment archiving is still enabled etc. For these users it be great to invalidate all reports only when isArchivePhpTriggered(). In all other cases to keep UI fast and not invalidate tracker caches too often we should only invalidate reports when needed as well see comments.

When browser archiving is enabled, it'll be a bit of a problem since we often have to invalidate reports in order to get accurate results unless some report was generated recently (say within 900s if default config is used). This might be still fine though. It'll just make Matomo even slower but maybe not possible to avoid it until we are in Matomo 4.

@diosmosis commented on February 25th 2020 Member

@tsteur I tried making changes to handle every possible case for when archiving, can you take a look? I haven't modified tests yet, you can probably just look at the latest commit.

@diosmosis commented on March 3rd 2020 Member

@tsteur noticed an issue w/ this change, if we invalidate only if there is something to invalidate, we may end up archiving more than once in some situations. For example:

  • archive for date 2012-02-02 and site 1 does not exist
  • tracking request comes in for that date & site, and pair is added to option list
  • archive for date 2012-02-02 and site 1 is requested
  • there is no existing archive, so nothing is invalidated
  • archiving is initiated for the day 2012-02-02, site 1
  • another archive request comes in for 2012-02-02, site 1
  • there is now an existing archive, and the date/site is in the list of pairs to archive, so we invalidate it
  • archiving is run again for 2012-02-02, site 1

We can't remove the date/site in the list in this case since it could also invalidate other archives that are not currently being requested.

The only way to do this change would be to run the invalidate query as a SELECT first then invalidate if it returns something, but that would of course be pointless.

NOTE: this only affects browser archiving, btw. Also with the 'today' change it may end up being a bit of an edge case... only other issue is that some archives may not be invalidated if they are never requested. Which would only be an issue if the options in the list are removed. We can have a daily task in that case to just run the invalidation I guess.

@tsteur commented on March 3rd 2020 Member

Which would only be an issue if the options in the list are removed. We can have a daily task in that case to just run the invalidation I guess.

Daily task for this should work. 👍

@tsteur commented on March 3rd 2020 Member

Or maybe at the beginning of the archiving we still need to invalidate all reports like we used to for cron archiving. For browser archiving handling would need a task indeed.

@diosmosis commented on March 3rd 2020 Member

Or maybe at the beginning of the archiving we still need to invalidate all reports like we used to for cron archiving.

We still do this, I haven't changed that (I disabled the Archive.php code for invalidation if browser archiving is enabled, so it's all done by core:archive, which I think is better IMO).

@diosmosis commented on March 4th 2020 Member

@tsteur tests should pass now, ready for a review. I may add some more tests in the meantime.

@tsteur commented on March 9th 2020 Member

👍 left another comment @diosmosis few tests are still failing but I suppose you still work in it. eg https://travis-ci.org/matomo-org/matomo/jobs/659112026?utm_medium=notification&utm_source=github_status

@diosmosis commented on March 9th 2020 Member

@tsteur fixed some tests, there are some unrelated failures though

This Pull Request was closed on March 10th 2020
Powered by GitHub Issue Mirror