@diosmosis opened this Pull Request on February 12th 2019 Member

WIP

Fixes #13387

@tsteur commented on March 7th 2019 Member

@diosmosis in getLatestArchiveTimestampForToday we check for the VisitsSummary name condition and in the other method where we get the ts_archived. Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Also I think the PR will need to a similar logic in CronArchive for segments? Which may not have been launched for quite a while?

@diosmosis commented on March 7th 2019 Member

@tsteur made a comment in the original issue about the segment related change (the code already exists in CronArchive): https://github.com/matomo-org/matomo/issues/13387#issuecomment-468560620

Does this return the time of the first blob/numeric value written within an archive? We we would need to make sure to check for the first created archive because it can take an hour to finish an archive or longer... In theory even before the first archive is written it may take a while to fetch the data for the first archive and write this. I suppose we can't log the time in the option table when the archiving was last launched (before executing any query) for a specific archive and compare against this?

Wouldn't the first created archive be the first archive of the day? That would defeat the purpose of making this change? This SQL was supposed to find the latest ts_archived value for an archive that can be used to query visits. Then the idea is, from that point on until 'now', if there are no visits found, then we don't need to initiate archiving and can just use the existing archive. The archiving code is extremely dense, though, so I am not entirely sure if this code has that effect... if you have any examples where it breaks, would like to see them.

@tsteur commented on March 7th 2019 Member

It wouldn't need to use the first archive of the day I think (not 100% about everything though). Just the launch date of the last launch of the archiving for a given idsite,date1,date2,period,segment.

image

The thinking is that archiving may take several minutes or hours. If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

I don't think we have that check in the code already as https://github.com/matomo-org/matomo/blob/3.x-dev/core/CronArchive.php#L1261-L1292 is around midnight.

@diosmosis commented on March 8th 2019 Member

If it uses the latest date, which may be 23:26:06 and there were visits between 2019-03-01 23:24:41 and 23:26:06 it may not be launched again as it thinks there was no new visit since 23:26:06?

If the ts_archived time is 23:26:06, then wouldn't include visits before that time (so, any visits between 23:24:41 => 23:26:06)? Ie, ts_archived is for when the archiving query finishes (AFAIK), so when we aggregate for nb_visits and create that archive for today, the nb_visits would include visits from start of day => ts_archived, right?

Of course, this doesn't apply to archiving dates in the past, just when running archiving for today.

@tsteur commented on March 8th 2019 Member

If the ts_archived time is 23:26:06, then wouldn't include visits before that time

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

@diosmosis commented on March 8th 2019 Member

only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it.

to finish all archiving or just the one archiving query? in this case we're only looking at archives for VisitsSummary / no segment. though I guess if it gets the all plugins that might take more time... actually, i'm not sure how ts_archived is even set... will take a look quickly.

@diosmosis commented on March 8th 2019 Member

looks like the ts_archived value is set when inserting individual idarchives, so the done archives will be set to when archiving finishes, but the individual metrics will have the insert record for when they are actually inserted. So maybe we can just look for the ts_archived value of the nb_visits metric? We'd get the latest done idarchives, and check for the latest ts_archived nb_visits for those idarchives. Would that work?

@tsteur commented on March 8th 2019 Member

it should work in most cases. There may be edge cases where it takes also a few ms or seconds or minutes to query and insert the nb_visits metric. Random thought: before querying any data, should we insert maybe also a start flag similar to the done flag? that may be consistent and precise

@diosmosis commented on March 8th 2019 Member

So we add a start flag, then we look for the latest done/done-temp archive for the site/period/etc. and get the idarchive. Then we get the ts_archived for the start flag in that archive, and check if there were any visits between start.ts_archived -> now. Does that sound right to you?

@tsteur commented on March 8th 2019 Member

That sounds right 👍

@diosmosis commented on March 11th 2019 Member

@tsteur Updated

@tsteur commented on March 11th 2019 Member

Left some comments @diosmosis

have you looked into triggering the archiving of segments only when needed in the CronArchive?

Also wondering have you tested what happens when receiving a tracking request for an older date and this older date needs to be rearchived maybe because the archive was invalidated or so?

So far looked only at the code so haven't tested it yet. Might be quite tricky.

@diosmosis commented on March 11th 2019 Member

have you looked into triggering the archiving of segments only when needed in the CronArchive?

That's what this commit should achieve: https://github.com/matomo-org/matomo/pull/14091/commits/23b5a1871f3c332b4ebc46235814a914abc3e7af

Don't think I've tested a lot, will have to do more...

@tsteur commented on March 11th 2019 Member

That's what this commit should achieve: 23b5a18

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

It sounds all pretty tricky and not sure how easily things are doable.

@diosmosis commented on March 12th 2019 Member

I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving.

This might require a larger refactoring of the cron archive process... maybe we can turn it into something a bit easier to manipulate in this way? I don't expect a change like this could be done by 3.9...

@tsteur commented on March 12th 2019 Member

It won't be doable in 3.9

Maybe we could split it in two parts though:
1) The core archiver / selector
2) Cron archiver not sending requests in the first place

This might make it also easier to review and I think most of the logic or all logic for 1) is maybe already there?

@diosmosis commented on March 12th 2019 Member

@tsteur they should both be there? The second one should be achieved by https://github.com/matomo-org/matomo/commit/23b5a1871f3c332b4ebc46235814a914abc3e7af (though there's some extra code that doesn't need to be there). Is there something off w/ it?

@tsteur commented on March 12th 2019 Member

Sorry I might be confused. I see you added some code for hadWebsiteTrafficSinceMidnightInTimezone but this does only take into consideration the overall traffic.

During one archive we may trigger heaps of requests for different periods and segments like

  • Day
    • Regular request
    • Segment 1
    • Segment 2
  • Week
    • Regular request
    • Segment 1
    • Segment 2
  • Month
    • Regular request
    • Segment 1
    • Segment 2

And if I see this correctly then we need to check for each request when that information was last archived before requesting the archive trigger URL. Because the regular monthly archive might have been archived way longer back than the daily archive and there could have been visits since the last archive time. The same for segments etc.

I would have basically kind of expected to see some new code before each $this->request($url); where we check when was this particular request last archived (based on start flag), and whether a visit was tracked since then.

Eg we might consider a time_before_month_archive_considered_outdated of 24 hours to archive it only once a day. Meaning the last archive was maybe 23 hours ago, not just 1 hour for the regular daily archive.

@diosmosis commented on March 12th 2019 Member

I see, so do the check I added before each request, for the specific archive in question, correct?

@tsteur commented on March 12th 2019 Member

Yep

@tsteur commented on April 11th 2019 Member

@diosmosis left a few comments. I reckon the main thing be to change as discussed CronArchive to check for new visits before each archiving request instead of just in the beginning. Eg because a segment has been processed more recently or a longer time ago which means while we can maybe skip a daily archive, we maybe cannot skip a segment archive.

@diosmosis commented on April 29th 2019 Member

@tsteur made the requested changes & merged w/ 3.x-dev. Not 100% sure about whether the cronarchive changes will have side effects or not

@tsteur commented on May 9th 2019 Member

Random thought @mattab @diosmosis

Why don't we get rid of temporary archives and always mark them as completed once the archiving finished. When then a new tracking request comes in, it will mark this archive automatically as invalid for the correct periods.
So we basically know we don't need to start archiving again, if there is still a completed archive. If there are only invalidated archives, then we need to archive. Could make the whole logic very simple and would work even when suddenly tracking a request in the past. It would invalidate the correct archives and automatically archive all needed reports again.

@diosmosis commented on May 17th 2019 Member

I can create an issue to get rid of the done_temporary archives, otherwise I can move this to 3.11.

@diosmosis commented on July 5th 2019 Member

I'll try to get rid of temporary tables in a new PR and get back to this one.

Powered by GitHub Issue Mirror