Some archiving optimizations #14091

diosmosis · 2019-02-12T03:48:53Z

WIP

…reate new archives unless forced.

… when checking to skip site in core:archive.

tsteur · 2019-03-07T22:12:37Z

@diosmosis in getLatestArchiveTimestampForToday we check for the VisitsSummary name condition and in the other method where we get the ts_archived. Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Also I think the PR will need to a similar logic in CronArchive for segments? Which may not have been launched for quite a while?

diosmosis · 2019-03-07T23:13:18Z

@tsteur made a comment in the original issue about the segment related change (the code already exists in CronArchive): #13387 (comment)

Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Wouldn't the first created archive be the first archive of the day? That would defeat the purpose of making this change? This SQL was supposed to find the latest ts_archived value for an archive that can be used to query visits. Then the idea is, from that point on until 'now', if there are no visits found, then we don't need to initiate archiving and can just use the existing archive. The archiving code is extremely dense, though, so I am not entirely sure if this code has that effect... if you have any examples where it breaks, would like to see them.

tsteur · 2019-03-07T23:57:23Z

It wouldn't need to use the first archive of the day I think (not 100% about everything though). Just the launch date of the last launch of the archiving for a given idsite,date1,date2,period,segment.

The thinking is that archiving may take several minutes or hours. If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

I don't think we have that check in the code already as https://github.com/matomo-org/matomo/blob/3.x-dev/core/CronArchive.php#L1261-L1292 is around midnight.

diosmosis · 2019-03-08T00:13:45Z

If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

If the ts_archived time is 23:26:06, then wouldn't include visits before that time (so, any visits between 23:24:41 => 23:26:06)? Ie, ts_archived is for when the archiving query finishes (AFAIK), so when we aggregate for nb_visits and create that archive for today, the nb_visits would include visits from start of day => ts_archived, right?

Of course, this doesn't apply to archiving dates in the past, just when running archiving for today.

tsteur · 2019-03-08T01:06:45Z

If the ts_archived time is 23:26:06, then wouldn't include visits before that time

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

diosmosis · 2019-03-08T01:53:41Z

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

to finish all archiving or just the one archiving query? in this case we're only looking at archives for VisitsSummary / no segment. though I guess if it gets the all plugins that might take more time... actually, i'm not sure how ts_archived is even set... will take a look quickly.

diosmosis · 2019-03-08T02:02:30Z

looks like the ts_archived value is set when inserting individual idarchives, so the done archives will be set to when archiving finishes, but the individual metrics will have the insert record for when they are actually inserted. So maybe we can just look for the ts_archived value of the nb_visits metric? We'd get the latest done idarchives, and check for the latest ts_archived nb_visits for those idarchives. Would that work?

tsteur · 2019-03-08T02:07:52Z

it should work in most cases. There may be edge cases where it takes also a few ms or seconds or minutes to query and insert the nb_visits metric. Random thought: before querying any data, should we insert maybe also a start flag similar to the done flag? that may be consistent and precise

diosmosis · 2019-03-08T02:18:44Z

So we add a start flag, then we look for the latest done/done-temp archive for the site/period/etc. and get the idarchive. Then we get the ts_archived for the start flag in that archive, and check if there were any visits between start.ts_archived -> now. Does that sound right to you?

tsteur · 2019-03-08T02:31:36Z

That sounds right 👍

…or new ArchiveSelector method.

diosmosis · 2019-03-11T02:25:47Z

@tsteur Updated

core/DataAccess/Model.php

core/DataAccess/ArchiveSelector.php

tsteur · 2019-03-11T22:02:29Z

Left some comments @diosmosis

have you looked into triggering the archiving of segments only when needed in the CronArchive?

Also wondering have you tested what happens when receiving a tracking request for an older date and this older date needs to be rearchived maybe because the archive was invalidated or so?

So far looked only at the code so haven't tested it yet. Might be quite tricky.

diosmosis · 2019-03-11T22:11:33Z

have you looked into triggering the archiving of segments only when needed in the CronArchive?

That's what this commit should achieve: 23b5a18

Don't think I've tested a lot, will have to do more...

tsteur · 2019-03-11T22:13:51Z

That's what this commit should achieve: 23b5a18

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

It sounds all pretty tricky and not sure how easily things are doable.

diosmosis · 2019-03-12T03:43:50Z

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

This might require a larger refactoring of the cron archive process... maybe we can turn it into something a bit easier to manipulate in this way? I don't expect a change like this could be done by 3.9...

tsteur · 2019-03-12T20:15:08Z

It won't be doable in 3.9

Maybe we could split it in two parts though:

The core archiver / selector
Cron archiver not sending requests in the first place

This might make it also easier to review and I think most of the logic or all logic for 1) is maybe already there?

diosmosis · 2019-03-12T20:19:26Z

@tsteur they should both be there? The second one should be achieved by 23b5a18 (though there's some extra code that doesn't need to be there). Is there something off w/ it?

tsteur · 2019-03-12T20:28:55Z

Sorry I might be confused. I see you added some code for hadWebsiteTrafficSinceMidnightInTimezone but this does only take into consideration the overall traffic.

During one archive we may trigger heaps of requests for different periods and segments like

Day
- Regular request
- Segment 1
- Segment 2
Week
- Regular request
- Segment 1
- Segment 2
Month
- Regular request
- Segment 1
- Segment 2

And if I see this correctly then we need to check for each request when that information was last archived before requesting the archive trigger URL. Because the regular monthly archive might have been archived way longer back than the daily archive and there could have been visits since the last archive time. The same for segments etc.

I would have basically kind of expected to see some new code before each $this->request($url); where we check when was this particular request last archived (based on start flag), and whether a visit was tracked since then.

Eg we might consider a time_before_month_archive_considered_outdated of 24 hours to archive it only once a day. Meaning the last archive was maybe 23 hours ago, not just 1 hour for the regular daily archive.

diosmosis · 2019-03-12T20:31:42Z

I see, so do the check I added before each request, for the specific archive in question, correct?

tsteur · 2019-03-12T20:34:29Z

Yep

core/DataAccess/ArchiveSelector.php

tsteur · 2019-04-11T02:39:20Z

@diosmosis left a few comments. I reckon the main thing be to change as discussed CronArchive to check for new visits before each archiving request instead of just in the beginning. Eg because a segment has been processed more recently or a longer time ago which means while we can maybe skip a daily archive, we maybe cannot skip a segment archive.

diosmosis · 2019-04-29T05:10:30Z

@tsteur made the requested changes & merged w/ 3.x-dev. Not 100% sure about whether the cronarchive changes will have side effects or not

tsteur

Did some basic testing and it worked. Although not sure re some edge cases, left some comments to discuss.

tsteur · 2019-05-08T01:42:26Z

core/ArchiveProcessor/PluginsArchiver.php

@@ -62,6 +62,7 @@ public function __construct(Parameters $params, $isTemporaryArchive, ArchiveWrit
        $this->isTemporaryArchive = $isTemporaryArchive;
        $this->archiveWriter = $archiveWriter ?: new ArchiveWriter($this->params, $this->isTemporaryArchive);
        $this->archiveWriter->initNewArchive();
+        $this->archiveWriter->insertRecord(ArchiveWriter::ARCHIVE_START_RECORD_NAME, 1);


not 100% sure... for segments... should we append the segment hash to this? similar to what we do for the done flag?

it will have the idArchive for the specific done flag, so as long as we find the idarchive first, it wouldn't need the suffix. Which makes the code a bit simpler since we don't have to compute it when querying.

tsteur · 2019-05-08T01:43:18Z

core/DataAccess/ArchiveSelector.php

+
+        $nameCondition = self::getNameCondition(['VisitsSummary'], $segment);
+
+        $sql = "SELECT MAX(idarchive)


Not too important, but I reckon if we were able to add the segment hash to the start flag, then we maybe could remove the query below and in this query do something like select max(ts_archived) ... where .... and name= ArchiveWriter::ARCHIVE_START_RECORD_NAME . $segmentHash?

(see response in below comment)

tsteur · 2019-05-08T02:25:15Z

core/DataAccess/ArchiveSelector.php

+        $sql = "SELECT ts_archived
+                  FROM $table
+                 WHERE idarchive = ? AND name = ?";
+        $bind = [$idArchive, ArchiveWriter::ARCHIVE_START_RECORD_NAME];


Initially I was wondering if we only should look at archives where the archiving finished in case the previous archiving had an error and interrupted (same idArchive should also have the done flag with status OK or temporary).

Then realised we can likely not do this because the current archiving might be in progress and that's the reason why the done flag is missing. => This should not be a problem though because the archiving would then be simply skipped as the logic should notice for that period/date there is already an archiver running right now.

So what if the last archive stopped in the middle because of an error, then we would maybe not start that archive again under circumstances? Say start flag is 00:15am, then the archiving broke in the middle, ... it would maybe not start the archiving again? Maybe, to play it safe, we might need to always check to only look at ts_archived for archives that have the start and the done flag.

Not sure I explain it well.

My understanding was that's what we do, at least that's what $nameCondition is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY). So we're only looking at done archives.

Incidentally I think this is also why we shouldn't append the suffix to the start record, since we'd have to check the done flag value anyway.

I know we've all said it before, but the archiving process confuses the heck out of me.

I know we've all said it before, but the archiving process confuses the heck out of me.

me too :)

My understanding was that's what we do, at least that's what $nameCondition is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY). So we're only looking at done archives.

Makes sense . Just need to double check that we also look at that in the CronArchive

CronArchive uses ArchiveSelector::getLatestArchiveStartTimestampForToday() which uses the name condition, so I think we're good here.

tsteur · 2019-05-08T02:28:20Z

core/CronArchive.php

@@ -863,6 +863,13 @@ protected function processArchiveDays($idSite, $lastTimestampWebsiteProcessedDay
            return false;
        }

+        if (!$this->hasThereBeenVisitsSinceLastArchiveTimestamp($idSite, 'day', $date)) {


Do I see this correctly that if a user has no new visit recorded since last archive, but recorded some visits in the past, we wouldn't trigger the archiving until there's actually a new visitor with a more recent date? Do you know what I mean? Probably can't work around this I suppose.

I see what you mean, I think we could take this into account by using idvisit somehow... Maybe find the maximum idvisit that is <= ts_archived, then check if there is an idvisit for that site > than the maximum idvisit? Does that sound right?

something like this could work. Might be complicated though. Wonder if #14091 (comment) would maybe make a lot of things easier. In theory the archiving logic should all be simple. After all it's not that complicated. We generate an archive and assume it's completed. Once a new tracking request comes in, we make sure to rearchive it. If no tracking request comes in again, we don't need to archive it again.

I'll see if I can add it w/ a test.

Your other comment makes sense to me, I'm actually not sure currently what temporary archives are used for currently. Removing and making the process simpler might be difficult given how complicated the system is currently, might be good to wait until matomo 4?

I reckon it might not be all that complicated? And maybe a lot easier than this logic? Not sure though. @mattab any thoughts on this idea maybe?

I'm just not too sure if it will break anything, and we also have to deal w/ the fact that existing archives may have DONE_TEMPORARY. Seems like there could be room for bugs that would be hard to track down... Maybe we can schedule it for 3.11 at least, so when it's deployed on cloud you'll be around.

tsteur · 2019-05-09T02:30:42Z

Random thought @mattab @diosmosis

Why don't we get rid of temporary archives and always mark them as completed once the archiving finished. When then a new tracking request comes in, it will mark this archive automatically as invalid for the correct periods.
So we basically know we don't need to start archiving again, if there is still a completed archive. If there are only invalidated archives, then we need to archive. Could make the whole logic very simple and would work even when suddenly tracking a request in the past. It would invalidate the correct archives and automatically archive all needed reports again.

diosmosis · 2019-05-17T08:47:32Z

I can create an issue to get rid of the done_temporary archives, otherwise I can move this to 3.11.

diosmosis · 2019-07-05T18:44:23Z

I'll try to get rid of temporary tables in a new PR and get back to this one.

tsteur · 2019-09-30T20:31:55Z

@diosmosis what's the status here?

diosmosis · 2019-09-30T20:35:11Z

@tsteur closing this, we can fix this in a better way since the invalidated archive change was merged

If no visits since last seen archives ts_archived timestamp, do not c…

3e70093

…reate new archives unless forced.

diosmosis added Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. labels Feb 12, 2019

diosmosis added this to the 3.9.0 milestone Feb 12, 2019

diosmosis added 3 commits February 28, 2019 19:31

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

537fb0c

Check for visits since last successful archive not just from midnight…

23b5a18

… when checking to skip site in core:archive.

Add since info for time.

d2c73f5

diosmosis added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Mar 4, 2019

diosmosis added 3 commits March 10, 2019 15:31

Use new start archive record for archiving start time and add tests f…

7f32a83

…or new ArchiveSelector method.

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

00ebcb6

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

2b7b92d

tsteur reviewed Mar 11, 2019

View reviewed changes

core/DataAccess/Model.php Outdated Show resolved Hide resolved

tsteur reviewed Mar 11, 2019

View reviewed changes

core/DataAccess/ArchiveSelector.php Outdated Show resolved Hide resolved

mattab modified the milestones: 3.9.0, 3.10.0 Mar 18, 2019

tsteur reviewed Apr 11, 2019

View reviewed changes

core/DataAccess/ArchiveSelector.php Outdated Show resolved Hide resolved

tsteur reviewed Apr 11, 2019

View reviewed changes

core/DataAccess/ArchiveSelector.php Outdated Show resolved Hide resolved

diosmosis added 4 commits April 28, 2019 15:47

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

8633cdf

Apply review feedback.

3deb23a

See if the build will change after reverting this change.

5eeb880

Do visits since last archive time for every individual archive request.

5f79d7e

tsteur requested changes May 8, 2019

View reviewed changes

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

cf8672a

tsteur requested a review from mattab May 9, 2019 01:56

Merge branch '3.x-dev' into 13387-archiver-temp-improvements

ca041d9

mattab modified the milestones: 3.10.0, 3.11.0 May 20, 2019

diosmosis mentioned this pull request Jul 6, 2019

Remove temporary archive concept and just invalidate on tracking. #14639

Merged

mattab modified the milestones: 3.11.0, 3.12.0 Jul 23, 2019

diosmosis closed this Sep 30, 2019

diosmosis deleted the 13387-archiver-temp-improvements branch September 30, 2019 20:35


		$nameCondition = self::getNameCondition(['VisitsSummary'], $segment);

		$sql = "SELECT MAX(idarchive)

Some archiving optimizations #14091

Some archiving optimizations #14091

Conversation

diosmosis commented Feb 12, 2019

tsteur commented Mar 7, 2019

diosmosis commented Mar 7, 2019

tsteur commented Mar 7, 2019

diosmosis commented Mar 8, 2019

tsteur commented Mar 8, 2019

diosmosis commented Mar 8, 2019

diosmosis commented Mar 8, 2019

tsteur commented Mar 8, 2019

diosmosis commented Mar 8, 2019

tsteur commented Mar 8, 2019

diosmosis commented Mar 11, 2019

tsteur commented Mar 11, 2019

diosmosis commented Mar 11, 2019

tsteur commented Mar 11, 2019

diosmosis commented Mar 12, 2019

tsteur commented Mar 12, 2019

diosmosis commented Mar 12, 2019

tsteur commented Mar 12, 2019

diosmosis commented Mar 12, 2019

tsteur commented Mar 12, 2019

tsteur commented Apr 11, 2019

diosmosis commented Apr 29, 2019

tsteur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

diosmosis May 9, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented May 9, 2019

diosmosis commented May 17, 2019

diosmosis commented Jul 5, 2019

tsteur commented Sep 30, 2019

diosmosis commented Sep 30, 2019

diosmosis May 9, 2019 •

edited