Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some archiving optimizations #14091

Closed
wants to merge 13 commits into from
Closed

Conversation

diosmosis
Copy link
Member

WIP

Fixes #13387

@diosmosis diosmosis added Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. labels Feb 12, 2019
@diosmosis diosmosis added this to the 3.9.0 milestone Feb 12, 2019
@diosmosis diosmosis added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Mar 4, 2019
@tsteur
Copy link
Member

tsteur commented Mar 7, 2019

@diosmosis in getLatestArchiveTimestampForToday we check for the VisitsSummary name condition and in the other method where we get the ts_archived. Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Also I think the PR will need to a similar logic in CronArchive for segments? Which may not have been launched for quite a while?

@diosmosis
Copy link
Member Author

@tsteur made a comment in the original issue about the segment related change (the code already exists in CronArchive): #13387 (comment)

Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Wouldn't the first created archive be the first archive of the day? That would defeat the purpose of making this change? This SQL was supposed to find the latest ts_archived value for an archive that can be used to query visits. Then the idea is, from that point on until 'now', if there are no visits found, then we don't need to initiate archiving and can just use the existing archive. The archiving code is extremely dense, though, so I am not entirely sure if this code has that effect... if you have any examples where it breaks, would like to see them.

@tsteur
Copy link
Member

tsteur commented Mar 7, 2019

It wouldn't need to use the first archive of the day I think (not 100% about everything though). Just the launch date of the last launch of the archiving for a given idsite,date1,date2,period,segment.

image

The thinking is that archiving may take several minutes or hours. If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

I don't think we have that check in the code already as https://github.com/matomo-org/matomo/blob/3.x-dev/core/CronArchive.php#L1261-L1292 is around midnight.

@diosmosis
Copy link
Member Author

If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

If the ts_archived time is 23:26:06, then wouldn't include visits before that time (so, any visits between 23:24:41 => 23:26:06)? Ie, ts_archived is for when the archiving query finishes (AFAIK), so when we aggregate for nb_visits and create that archive for today, the nb_visits would include visits from start of day => ts_archived, right?

Of course, this doesn't apply to archiving dates in the past, just when running archiving for today.

@tsteur
Copy link
Member

tsteur commented Mar 8, 2019

If the ts_archived time is 23:26:06, then wouldn't include visits before that time

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

@diosmosis
Copy link
Member Author

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

to finish all archiving or just the one archiving query? in this case we're only looking at archives for VisitsSummary / no segment. though I guess if it gets the all plugins that might take more time... actually, i'm not sure how ts_archived is even set... will take a look quickly.

@diosmosis
Copy link
Member Author

looks like the ts_archived value is set when inserting individual idarchives, so the done archives will be set to when archiving finishes, but the individual metrics will have the insert record for when they are actually inserted. So maybe we can just look for the ts_archived value of the nb_visits metric? We'd get the latest done idarchives, and check for the latest ts_archived nb_visits for those idarchives. Would that work?

@tsteur
Copy link
Member

tsteur commented Mar 8, 2019

it should work in most cases. There may be edge cases where it takes also a few ms or seconds or minutes to query and insert the nb_visits metric. Random thought: before querying any data, should we insert maybe also a start flag similar to the done flag? that may be consistent and precise

@diosmosis
Copy link
Member Author

So we add a start flag, then we look for the latest done/done-temp archive for the site/period/etc. and get the idarchive. Then we get the ts_archived for the start flag in that archive, and check if there were any visits between start.ts_archived -> now. Does that sound right to you?

@tsteur
Copy link
Member

tsteur commented Mar 8, 2019

That sounds right 👍

@diosmosis
Copy link
Member Author

@tsteur Updated

core/DataAccess/Model.php Outdated Show resolved Hide resolved
@tsteur
Copy link
Member

tsteur commented Mar 11, 2019

Left some comments @diosmosis

have you looked into triggering the archiving of segments only when needed in the CronArchive?

Also wondering have you tested what happens when receiving a tracking request for an older date and this older date needs to be rearchived maybe because the archive was invalidated or so?

So far looked only at the code so haven't tested it yet. Might be quite tricky.

@diosmosis
Copy link
Member Author

have you looked into triggering the archiving of segments only when needed in the CronArchive?

That's what this commit should achieve: 23b5a18

Don't think I've tested a lot, will have to do more...

@tsteur
Copy link
Member

tsteur commented Mar 11, 2019

That's what this commit should achieve: 23b5a18

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

It sounds all pretty tricky and not sure how easily things are doable.

@diosmosis
Copy link
Member Author

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

This might require a larger refactoring of the cron archive process... maybe we can turn it into something a bit easier to manipulate in this way? I don't expect a change like this could be done by 3.9...

@tsteur
Copy link
Member

tsteur commented Mar 12, 2019

It won't be doable in 3.9

Maybe we could split it in two parts though:

  1. The core archiver / selector
  2. Cron archiver not sending requests in the first place

This might make it also easier to review and I think most of the logic or all logic for 1) is maybe already there?

@diosmosis
Copy link
Member Author

@tsteur they should both be there? The second one should be achieved by 23b5a18 (though there's some extra code that doesn't need to be there). Is there something off w/ it?

@tsteur
Copy link
Member

tsteur commented Mar 12, 2019

Sorry I might be confused. I see you added some code for hadWebsiteTrafficSinceMidnightInTimezone but this does only take into consideration the overall traffic.

During one archive we may trigger heaps of requests for different periods and segments like

  • Day
    • Regular request
    • Segment 1
    • Segment 2
  • Week
    • Regular request
    • Segment 1
    • Segment 2
  • Month
    • Regular request
    • Segment 1
    • Segment 2

And if I see this correctly then we need to check for each request when that information was last archived before requesting the archive trigger URL. Because the regular monthly archive might have been archived way longer back than the daily archive and there could have been visits since the last archive time. The same for segments etc.

I would have basically kind of expected to see some new code before each $this->request($url); where we check when was this particular request last archived (based on start flag), and whether a visit was tracked since then.

Eg we might consider a time_before_month_archive_considered_outdated of 24 hours to archive it only once a day. Meaning the last archive was maybe 23 hours ago, not just 1 hour for the regular daily archive.

@diosmosis
Copy link
Member Author

I see, so do the check I added before each request, for the specific archive in question, correct?

@tsteur
Copy link
Member

tsteur commented Mar 12, 2019

Yep

@mattab mattab modified the milestones: 3.9.0, 3.10.0 Mar 18, 2019
@tsteur
Copy link
Member

tsteur commented Apr 11, 2019

@diosmosis left a few comments. I reckon the main thing be to change as discussed CronArchive to check for new visits before each archiving request instead of just in the beginning. Eg because a segment has been processed more recently or a longer time ago which means while we can maybe skip a daily archive, we maybe cannot skip a segment archive.

@diosmosis
Copy link
Member Author

@tsteur made the requested changes & merged w/ 3.x-dev. Not 100% sure about whether the cronarchive changes will have side effects or not

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some basic testing and it worked. Although not sure re some edge cases, left some comments to discuss.

@@ -62,6 +62,7 @@ public function __construct(Parameters $params, $isTemporaryArchive, ArchiveWrit
$this->isTemporaryArchive = $isTemporaryArchive;
$this->archiveWriter = $archiveWriter ?: new ArchiveWriter($this->params, $this->isTemporaryArchive);
$this->archiveWriter->initNewArchive();
$this->archiveWriter->insertRecord(ArchiveWriter::ARCHIVE_START_RECORD_NAME, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure... for segments... should we append the segment hash to this? similar to what we do for the done flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will have the idArchive for the specific done flag, so as long as we find the idarchive first, it wouldn't need the suffix. Which makes the code a bit simpler since we don't have to compute it when querying.


$nameCondition = self::getNameCondition(['VisitsSummary'], $segment);

$sql = "SELECT MAX(idarchive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too important, but I reckon if we were able to add the segment hash to the start flag, then we maybe could remove the query below and in this query do something like select max(ts_archived) ... where .... and name= ArchiveWriter::ARCHIVE_START_RECORD_NAME . $segmentHash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see response in below comment)

$sql = "SELECT ts_archived
FROM $table
WHERE idarchive = ? AND name = ?";
$bind = [$idArchive, ArchiveWriter::ARCHIVE_START_RECORD_NAME];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was wondering if we only should look at archives where the archiving finished in case the previous archiving had an error and interrupted (same idArchive should also have the done flag with status OK or temporary).

Then realised we can likely not do this because the current archiving might be in progress and that's the reason why the done flag is missing. => This should not be a problem though because the archiving would then be simply skipped as the logic should notice for that period/date there is already an archiver running right now.

So what if the last archive stopped in the middle because of an error, then we would maybe not start that archive again under circumstances? Say start flag is 00:15am, then the archiving broke in the middle, ... it would maybe not start the archiving again? Maybe, to play it safe, we might need to always check to only look at ts_archived for archives that have the start and the done flag.

Not sure I explain it well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that's what we do, at least that's what $nameCondition is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY). So we're only looking at done archives.

Incidentally I think this is also why we shouldn't append the suffix to the start record, since we'd have to check the done flag value anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we've all said it before, but the archiving process confuses the heck out of me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we've all said it before, but the archiving process confuses the heck out of me.

me too :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that's what we do, at least that's what $nameCondition is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY). So we're only looking at done archives.

Makes sense . Just need to double check that we also look at that in the CronArchive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CronArchive uses ArchiveSelector::getLatestArchiveStartTimestampForToday() which uses the name condition, so I think we're good here.

@@ -863,6 +863,13 @@ protected function processArchiveDays($idSite, $lastTimestampWebsiteProcessedDay
return false;
}

if (!$this->hasThereBeenVisitsSinceLastArchiveTimestamp($idSite, 'day', $date)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I see this correctly that if a user has no new visit recorded since last archive, but recorded some visits in the past, we wouldn't trigger the archiving until there's actually a new visitor with a more recent date? Do you know what I mean? Probably can't work around this I suppose.

Copy link
Member Author

@diosmosis diosmosis May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, I think we could take this into account by using idvisit somehow... Maybe find the maximum idvisit that is <= ts_archived, then check if there is an idvisit for that site > than the maximum idvisit? Does that sound right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this could work. Might be complicated though. Wonder if #14091 (comment) would maybe make a lot of things easier. In theory the archiving logic should all be simple. After all it's not that complicated. We generate an archive and assume it's completed. Once a new tracking request comes in, we make sure to rearchive it. If no tracking request comes in again, we don't need to archive it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can add it w/ a test.

Your other comment makes sense to me, I'm actually not sure currently what temporary archives are used for currently. Removing and making the process simpler might be difficult given how complicated the system is currently, might be good to wait until matomo 4?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon it might not be all that complicated? And maybe a lot easier than this logic? Not sure though. @mattab any thoughts on this idea maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just not too sure if it will break anything, and we also have to deal w/ the fact that existing archives may have DONE_TEMPORARY. Seems like there could be room for bugs that would be hard to track down... Maybe we can schedule it for 3.11 at least, so when it's deployed on cloud you'll be around.

@tsteur tsteur requested a review from mattab May 9, 2019 01:56
@tsteur
Copy link
Member

tsteur commented May 9, 2019

Random thought @mattab @diosmosis

Why don't we get rid of temporary archives and always mark them as completed once the archiving finished. When then a new tracking request comes in, it will mark this archive automatically as invalid for the correct periods.
So we basically know we don't need to start archiving again, if there is still a completed archive. If there are only invalidated archives, then we need to archive. Could make the whole logic very simple and would work even when suddenly tracking a request in the past. It would invalidate the correct archives and automatically archive all needed reports again.

@diosmosis
Copy link
Member Author

I can create an issue to get rid of the done_temporary archives, otherwise I can move this to 3.11.

@mattab mattab modified the milestones: 3.10.0, 3.11.0 May 20, 2019
@diosmosis
Copy link
Member Author

I'll try to get rid of temporary tables in a new PR and get back to this one.

@tsteur
Copy link
Member

tsteur commented Sep 30, 2019

@diosmosis what's the status here?

@diosmosis
Copy link
Member Author

@tsteur closing this, we can fix this in a better way since the invalidated archive change was merged

@diosmosis diosmosis closed this Sep 30, 2019
@diosmosis diosmosis deleted the 13387-archiver-temp-improvements branch September 30, 2019 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
3 participants