@diosmosis opened this Pull Request on April 30th 2020 Member

Fixes #11974

This PR is based off of #15499 .

Changes made:

  • we allow invalidating plugin specific archives, this results in entries being added to archive_invalidations for the plugin archive. they will not be added if we invalidate a normal archive, even if as part of that invalidation some dependent plugin specific archives (like ones created by VisitorInterest/Goals) are invalidated.
  • in core:archive, if we see a plugin specific archive in the invalidations table, we archive it separately w/ the pluginOnly=1 query parameter.
  • in Rules.php if in core:archive and pluginOnly=1 is used, we force creating a plugin specific archive.
  • in Archive.php/Loader.php/etc. we don't just look for one type of archive (ie, plugin vs all), we look through every one and use the latest archive values. so if there is an all archive and a ExamplePlugin archive that is more recent, the archive data in ExamplePlugin will be used for ExamplePlugin data, but other plugin data will come from all.
  • plugins on activation can then invalidate archives for their own plugin to schedule archiving for data in the past, without initiating archiving for every other plugin at the same time.

Notes:

  • browser archiving is not supported. If a user does not use core:archive, the invalidations don't do anything unless the plugin specific archive existed before. I think this is ok since premium plugins need this feature and we don't expect them to be used w/o core:archive, correct?

Example PR can be found in funnels

@diosmosis commented on April 30th 2020 Member

@tsteur this PR is based off the other one so might be hard to review, but it's ready for an initial one. once that's done and it looks good to be in core, then I'll probably refactor CronArchive to test more of its internals. Don't want to do this now since both PRs are already pretty significant.

@tsteur commented on May 6th 2020 Member

@diosmosis just fyi started reviewing but it's bit difficult as it is based on https://github.com/matomo-org/matomo/pull/15499/files so I reckon we will be trying to get #15499 merged first which will make it easier

@diosmosis commented on May 7th 2020 Member

@tsteur you should be able to review this one now

@tsteur commented on May 8th 2020 Member

@diosmosis had a first look for a while and starting to understand few things. Must have been so hard to develop!

I've had a look at the funnels pull request to see how plugins basically trigger rearchiving of data for that plugin. Do I basically see this right that when eg someone adds or changes a custom report, then custom reports will mark archives as invalid like below?

     $invalidator->markArchivesAsInvalidated($idSites, $dates, 'month', $segment = null, $cascadeDown = false, $invalidateRanges = false,
            $plugin = 'CustomReports');

I wonder if we need something similar than process_new_segments_from where we configure how far back by default we rearchive things? like process_plugin_specific_archives_from=beginning_of_time|lastN. I guess last_edit_time and creation_time wouldn't quite work because the creation time is always different for each custom report, each funnel, each ... We could then maybe have a method in core that automatically goes back the number of months as needed and all the plugin would need to do be like StaticContainer::get(ArchiveInvalidator::class)->rearchivePlugin($pluginName, $idSite). Basically, the funnels activate method be mostly in core.

In Funnels plugin we wouldn't run this on activate btw. but instead when a new funnel is added or changed. Some for custom reports when a custom report was added or changed. A/B testing might also trigger later a rearchive when the A/B test was changed etc.

I suppose the plugin would always need to rearchive everything? Say there are 20 custom reports and a new one was added. I suppose we'd then need to rearchive all custom reports for the given date range? I reckon, if we know in the Custom Reports Archiver that a plugin specific archive is requested, we could maybe check which custom reports were changed or created recently and only archive these but not 100% sure yet if that works.

@tsteur commented on May 8th 2020 Member

I suppose we would also need to have a flag whether to also reprocess all segments for the given plugin or not.

@diosmosis commented on May 11th 2020 Member

@tsteur

looking at it again right now my biggest concern is performance and IO. Do you think with this logic it is possible to only trigger an archive for a specific part within a plugin?

Yes, this is possible via standardizing the setArchiveOnlyReport() approach. Plugins will have to handle this themselves though.

Will the archiver know which archives need to be regenerated eg 900 seconds (see time_before_today_archive_considered_outdated) later at 8:22pm? (We'd expect the archiver not to trigger a new archive because the last full archive was done at 8:20pm maybe (120 seconds later only))

I'm not sure I 100% understand, but here some possible cases:

  • an invalidate for all plugins for same idsite/period is set
  • an invalidate for funnels (ID 4) is set
  • archiver picks up all plugins archive first
  • then plugin specific

what happens: not sure exactly what happens here, but ideally we would skip the plugin specific archive. in fact we can probably add this logic if it's not there already (or move it if it is) to canSkipThisArchive(), which will apply it in every place that matters.

  • a plugin specific archive for funnels (funnel ID 4) is archived first at 8pm
  • then we start an all plugins archive at say 8:20pm

what happens: this logic isn't implemented, but here's what could happen. first setArchiveOnlyReport() takes an array of reports. second, in PluginsArchiver before archiving we select every blob/numeric name in the queried idarchives w/ ts_archived that is older than time_before_today_archive_considered_outdated. we set that as setArchiveOnlyReport() and archive each plugin. this might be overkill though given the plugin specific re-archive shouldn't happen often. anyway, then the all plugins archive won't contain the still valid data.

And will the logic always use the latest archive for a funnels? It would need to use the one from Archive ID 5 in this case and ignore the plugin specific report?

That is the intent. When querying archive data we're sorting by idarchive desc, so it should pick the data w/ the latest idarchive. We could make it more specific by sorting by ts_archived desc.

@diosmosis commented on May 11th 2020 Member

When querying archive data we're sorting by idarchive desc, so it should pick the data w/ the latest idarchive.

Actually we're not doing that but we're supposed to be.

@tsteur commented on May 12th 2020 Member

Yes, this is possible via standardizing the setArchiveOnlyReport() approach. Plugins will have to handle this themselves though.

Does this mean in the archive invalidate logic we'd also store an ID and then somehow instead of $plugin.get we trigger some CoreAdmin.archivePlugin(pluginName, idToArchive) method or so? Then this API for example directly creates an instance of the archiver and triggers the archive for that plugin (and calls setArchiveOnlyReport) should an API be specified? I reckon if that exists things could/should work.

BTW be good to also add things mentioned in https://github.com/matomo-org/matomo/pull/15889#issuecomment-626714794

@diosmosis commented on May 12th 2020 Member

Does this mean in the archive invalidate logic we'd also store an ID and then somehow instead of $plugin.get we trigger some CoreAdmin.archivePlugin(pluginName, idToArchive) method or so?

Something like this, but it's possible we invalidate an archive that doesn't exist, in which case there won't be an ID.

However if you're assuming the code will just replace the existing report in an archive, then this isn't that simple due to what we discussed in slack. If not, can you think about my comments on creating the new plugin specific archive?

@tsteur commented on May 13th 2020 Member

Something like this, but it's possible we invalidate an archive that doesn't exist, in which case there won't be an ID.

@diosmosis not sure what you mean here? I was thinking eg we could add an ID to the archive invalidate table and a plugin may provide an additional ID when invalidating a report which the CronArchive would then forward when archiving that invalidated archive entry. Not sure it's clear what I mean? We can still add a new archiveId and not alter an existing archive.

@diosmosis commented on June 4th 2020 Member

@tsteur finished an implementation for setArchiveOnlyReport(). I still want to create a POC PR for CustomReports and need to fix the tests, but you can take a look at the approach.

I introduced a new done value, DONE_PARTIAL, for archives w/ only some (or just one) report. It is treated as an invalidated archive when querying, if browser triggered archiving is enabled it ignores them and will re-archive if a full archive is not found. If core:archive is enabled, if there is a partial archive, we look in it along w/ other latest archives when querying. When archiving in core:archive, we ignore it and only look for all or plugin archives (this way if there is a newer partial archive, we still re-archive everything).

@tsteur commented on June 4th 2020 Member

@diosmosis was the done_partial needed? For some specific reason maybe?

@diosmosis commented on June 4th 2020 Member

was the done_partial needed? For some specific reason maybe?

Yes, to only re-archive a specific report, eg, for custom reports. We can't update the existing report because we don't have the logic to update blob reports like that. If we add a new archive we have to make sure it's not treated like a full archive, since there will just be one report.

@diosmosis commented on June 5th 2020 Member

Might be able to simplify some w/ creation of the new DONE_PARTIAL flag, I'll check.

@diosmosis commented on June 10th 2020 Member

@tsteur simplified the code a bit and finished the poc

@diosmosis commented on June 11th 2020 Member

btw a comment explaining why we might get multiple idArchives and which one will be used for what might be good?

will add a comment. reason is that now we can have full archives (DONE_OK/DONE_INVALIDATED and partial archives w/ single reports, w/ DONE_PARTIAL). so we have to look in both and use the latest data.

@diosmosis commented on June 14th 2020 Member

Updated.

@diosmosis commented on June 15th 2020 Member

@tsteur

I think I know what the ordering issue is, will look for a fix.

Another issue I'm not 100% sure about it. I looked into the archive tables and under the most recent archives I would have only expected to see CustomReports_customreport_16_0_nb_visits and done archive names. But it did also archive several other metrics see screenshot. I get we maybe need for our standard logic the nb_visits (ideally we wouldn't need to archive any of that) but maybe we could ignore the other metrics?

I suspect that is due to the Loader.php code needing the visits in order to check if we can skip the archive. I recently added an optimization Loader::canSkipThisArchive() which checks if there is a visit in log_visit, I wonder if we can remove the other one that invokes VisitsSummary (in Loader::prepareCoreMetricsArchive())? Will there ever be a need to find exactly how many visits there are before archiving, if we know we can't (or can) find at least one visit in the log table?

EDIT: actually I guess it's needed for segments, since canSkipThisArchive() doesn't apply a segment (don't want to scan everything in the range).

@tsteur commented on June 15th 2020 Member

, I wonder if we can remove the other one that invokes VisitsSummary (in Loader::prepareCoreMetricsArchive())? Will there ever be a need to find exactly how many visits there are before archiving, if we know we can't (or can) find at least one visit in the log table?

i'm not 100% sure about why it's there but I presume it might be still needed for browser archiving maybe as it can basically ignore running the entire archive if there are no visits?

Wondering... if we archive only a specific report, in aggregateDayVisitsMetrics could we maybe at least skip $this->archiveProcessor->insertNumericRecords($metrics);? So at least we safe some IO etc if we still have to query that data? Or only insert nb_visits if that is needed for some reason?

The code looks there also quite tricky and not fully understanding it why we potentially run callAggregateCoreMetrics twice https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor/Loader.php#L162-L181 or why it would run it in the first method and in other cases in the second method...

I'm actually not sure why the loader writes these nb_visits as the returned value from Loader::prepareArchive() seems to be not really used (except for now in the new ArchiveReports method) but maybe that all wouldn't even be needed? Guess it's needed to determine if we accept the idArchive or not but this might be actually wrong behaviour. Eg if there are no visits in https://github.com/matomo-org/matomo/blob/4.x-dev/core/ArchiveProcessor/Loader.php#L138-L140, then it maybe shouldn't ignore the idarchive? If I understand it correct it also means it would try to archive every time if there is no visit instead of respecting the configured TTL and eg only archive again 900s later?

Sorry I'm not of better help.

@diosmosis commented on June 15th 2020 Member

i'm not 100% sure about why it's there but I presume it might be still needed for browser archiving maybe as it can basically ignore running the entire archive if there are no visits?

It's needed for when segments are archived. If we find we have 0 visits, then we shouldn't initiate archiving. However, maybe we can use the segment table optimization for this?

If we launch archiving for a segment, then we have to do this query to create the temp table, and even if there are no visits, to get nb_visits, we'd have to do this query as well, correct? So we can do this once, then if there are no rows in the temp table skip archiving for the segment. This would mean not having to archive those VisitsSummary metrics first.

Only time it wouldn't be as fast is if for some reason there already is an nb_visits parameter. I guess we can check for that pretty easily though. @tsteur what do you think about this? I guess I can just try to see if it works.

@diosmosis commented on June 16th 2020 Member

Wondering... if we archive only a specific report, in aggregateDayVisitsMetrics could we maybe at least skip $this->archiveProcessor->insertNumericRecords($metrics);? So at least we safe some IO etc if we still have to query that data? Or only insert nb_visits if that is needed for some reason?

I think maybe this isn't needed. VisitsSummary metrics should be present already when rearchiving data in the past. If it's not there, then it would have to be development setup. So I think this is a bit of a non-issue?

@tsteur commented on June 17th 2020 Member

I think maybe this isn't needed. VisitsSummary metrics should be present already when rearchiving data in the past. If it's not there, then it would have to be development setup. So I think this is a bit of a non-issue?

Not 100% what you mean with non-issue? I reckon could try to not write these archive entries and assume it might not be needed but probably the nb_visit will be needed when fetching that archive next time in the loader.

@diosmosis commented on June 17th 2020 Member

Not 100% what you mean with non-issue? I reckon could try to not write these archive entries and assume it might not be needed but probably the nb_visit will be needed when fetching that archive next time in the loader.

I mean in a normal setup, when archiving the last month for a plugin, nb_visits should already exist. Or was there already a valid archive when you tested and it initiated another done.VisitsSummary archive too?

@tsteur commented on June 17th 2020 Member

Or was there already a valid archive when you tested and it initiated another done.VisitsSummary archive too?

There was already a valid generic archive. This might not always be the case though. That's why I reckon we always have to write nb_visits but not really the others. I wonder if ideally it could check if there is a valid generic archive already, and if so, use that nb_visits value instead of querying this metrics again from the raw data.

However, this would be maybe not worth the effort for now as it would require quite some extra code and trying to find the idArchive of the generic archive maybe etc.

I reckon if we do a plugin specific (re)archive, then we could still run that query for core metrics, but at least only insert nb_visits metric, not the others.

@diosmosis commented on July 7th 2020 Member

@tsteur this can be reviewed again. I ensured the latest archive is always used (tested by new LoaderTest tests). It will still insert other metrics because otherwise matomo will think the value for them is 0.

@diosmosis commented on July 20th 2020 Member

Reason I'm thinking about is eg when we set up a new cloud account. Then we install several plugins, eg Funnels. It would then potentially create heaps of archive invalidations maybe even though it's a fresh Matomo and nothing was tracked yet.

Added.

Actually I think I found it in https://github.com/matomo-org/matomo/pull/15889/files#diff-f0dc4d4d133f84005fcebc1e10f155c0R202 and it's there so might be fine to ignore.

Is this referring to what I added? That is in the core:archive side specifically for segment archiving, there is also Loader::canSkipArchive() which is run during core:archive and in Archive.php, so empty archives won't be created regardless. But it would still be a small boost to performance to not have to consider these sites anyway.

@diosmosis commented on July 20th 2020 Member

@tsteur I did some debugging for the ttl issue and found the issue,

Configure:

[General]
time_before_today_archive_considered_outdated = 1

Track a request
Run core:archive
It will archive day, week, month, year
Track a request
Run core:archive again
Now it will archive day, week, month, year again even though it should archive only day because the 900s TTL for week/month/year. Otherwise we'd archive these periods too often.

The ttl for a period is determined by Rules::getPeriodArchiveTimeToLiveDefault(). If the period is day, the today TTL is used. Otherwise the time_before_%s_archive_considered_outdated TTL is used. However, if that config isn't set, then the today TTL is used.

I'm guessing this logic is incorrect... can you clarify how you think it should work? today ttl if date == today and period ttl otherwise and maybe 900 if period ttl is not defined in config? the today ttl can also be the UI setting option.

@tsteur commented on July 21st 2020 Member

@diosmosis how the config behaves is fine. (that today is used when no week, month, year setting is specified)

I just tested it this way:

[General]
time_before_today_archive_considered_outdated = 1
time_before_week_archive_considered_outdated = 900
time_before_year_archive_considered_outdated = 900

using the same flow and would have expected week and year a larger TTL to be recognised.

I basically did this:

  • run core:archive
  • track a reuqest
  • run core:archive again
    • Was expecting only today and month to be archived, but not week and year since it has been only few seconds and not 900 seconds

Here's the verbose output if that helps:

START
Starting Matomo reports archiving...
Checking for queued invalidations...
  Will invalidate archived reports for today in site ID = 1's timezone (2020-07-21 00:00:00).
Found existing valid archive for yesterday, skipping invalidation...
DEBUG [2020-07-21 01:13:06] 84702  Earliest created time of segment 'visitIp>=15.15.15.15;visitIp<=20.20.20.20,visitIp==21.21.21.21,visitIp==22.22.22.22,actionUrl=<a class='mention' href='https://github.com/foo'>@foo</a>' w/ idSite = 1 is found to be 2019-02-04. Latest edit time is found to be 2019-02-04.
DEBUG [2020-07-21 01:13:06] 84702  Earliest created time of segment 'actions>=2' w/ idSite = 7 is found to be 2020-03-30. Latest edit time is found to be 2020-03-30.
Done invalidating
Start processing archives for site 1.
DEBUG [2020-07-21 01:13:06] 84702  Processing invalidation: [idinvalidation = 2365, idsite = 1, period = day(2020-07-21 - 2020-07-21), name = done].
DEBUG [2020-07-21 01:13:06] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2366, idsite = 1, period = week(2020-07-20 - 2020-07-26), name = done]
DEBUG [2020-07-21 01:13:06] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2367, idsite = 1, period = month(2020-07-01 - 2020-07-31), name = done]
DEBUG [2020-07-21 01:13:06] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2368, idsite = 1, period = year(2020-01-01 - 2020-12-31), name = done]
DEBUG [2020-07-21 01:13:06] 84702  No next invalidated archive.
DEBUG [2020-07-21 01:13:06] 84702  Starting archiving for ?module=API&method=CoreAdminHome.archiveReports&idSite=1&period=day&date=2020-07-21&format=json&trigger=archivephp
DEBUG [2020-07-21 01:13:07] 84702  Running command: /usr/local/Cellar/php/7.4.6_1/bin/php -q  /Users/thomassteur/Development/piwik/console climulti:request -q --matomo-domain='' --superuser 'module=API&method=CoreAdminHome.archiveReports&idSite=1&period=day&date=2020-07-21&format=json&trigger=archivephp&pid=9ccd666e3db84c5560cae4743d5bc094b7704451dae04f55fe446b14558f401dcc46d5c8ab73ca0baf2608f4b3c2a805852e0&runid=84702' > /Users/thomassteur/Development/piwik/tmp/api5.matomo.cloud/climulti/9ccd666e3db84c5560cae4743d5bc094b7704451dae04f55fe446b14558f401dcc46d5c8ab73ca0baf2608f4b3c2a805852e0.output 2>&1 &
INFO [2020-07-21 01:13:07] 84702  Archived website id 1, period = day, date = 2020-07-21, segment = '', 0 visits found. Time elapsed: 0.595s
DEBUG [2020-07-21 01:13:07] 84702  Processing invalidation: [idinvalidation = 2366, idsite = 1, period = week(2020-07-20 - 2020-07-26), name = done].
DEBUG [2020-07-21 01:13:07] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2367, idsite = 1, period = month(2020-07-01 - 2020-07-31), name = done]
DEBUG [2020-07-21 01:13:07] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2368, idsite = 1, period = year(2020-01-01 - 2020-12-31), name = done]
DEBUG [2020-07-21 01:13:07] 84702  No next invalidated archive.
DEBUG [2020-07-21 01:13:07] 84702  Starting archiving for ?module=API&method=CoreAdminHome.archiveReports&idSite=1&period=week&date=2020-07-20&format=json&trigger=archivephp
DEBUG [2020-07-21 01:13:07] 84702  Running command: /usr/local/Cellar/php/7.4.6_1/bin/php -q  /Users/thomassteur/Development/piwik/console climulti:request -q --matomo-domain='' --superuser 'module=API&method=CoreAdminHome.archiveReports&idSite=1&period=week&date=2020-07-20&format=json&trigger=archivephp&pid=c6831c061564d18e27768d2214fd0040eebe919e967c4d93012e4ae074da0a38b735e84d4e156d3786caea624933965cef860&runid=84702' > /Users/thomassteur/Development/piwik/tmp/api5.matomo.cloud/climulti/c6831c061564d18e27768d2214fd0040eebe919e967c4d93012e4ae074da0a38b735e84d4e156d3786caea624933965cef860.output 2>&1 &
INFO [2020-07-21 01:13:08] 84702  Archived website id 1, period = week, date = 2020-07-20, segment = '', 1 visits found. Time elapsed: 1.174s
DEBUG [2020-07-21 01:13:09] 84702  Processing invalidation: [idinvalidation = 2367, idsite = 1, period = month(2020-07-01 - 2020-07-31), name = done].
DEBUG [2020-07-21 01:13:09] 84702  Found archive with intersecting period with others in concurrent batch, skipping until next batch: [idinvalidation = 2368, idsite = 1, period = year(2020-01-01 - 2020-12-31), name = done]
DEBUG [2020-07-21 01:13:09] 84702  No next invalidated archive.
DEBUG [2020-07-21 01:13:09] 84702  Starting archiving for ?module=API&method=CoreAdminHome.archiveReports&idSite=1&period=month&date=2020-07-01&format=json&trigger=archivephp
DEBUG [2020-07-21 01:13:09] 84702  Running command: /usr/local/Cellar/php/7.4.6_1/bin/php -q  /Users/thomassteur/Development/piwik/console climulti:request -q --matomo-domain='' --superuser 'module=API&method=CoreAdminHome.archiveReports&idSite=1&period=month&date=2020-07-01&format=json&trigger=archivephp&pid=08097f6b2594b0d8d8870c0e28928bd1d3201024cf88a37d98bd97500465290faa609dd0685c2033157118fd01760471d36d0&runid=84702' > /Users/thomassteur/Development/piwik/tmp/api5.matomo.cloud/climulti/08097f6b2594b0d8d8870c0e28928bd1d3201024cf88a37d98bd97500465290faa609dd0685c2033157118fd01760471d36d0.output 2>&1 &
INFO [2020-07-21 01:13:20] 84702  Archived website id 1, period = month, date = 2020-07-01, segment = '', 6 visits found. Time elapsed: 11.088s
DEBUG [2020-07-21 01:13:20] 84702  Processing invalidation: [idinvalidation = 2368, idsite = 1, period = year(2020-01-01 - 2020-12-31), name = done].
DEBUG [2020-07-21 01:13:20] 84702  No next invalidated archive.
DEBUG [2020-07-21 01:13:20] 84702  Starting archiving for ?module=API&method=CoreAdminHome.archiveReports&idSite=1&period=year&date=2020-01-01&format=json&trigger=archivephp
DEBUG [2020-07-21 01:13:20] 84702  Running command: /usr/local/Cellar/php/7.4.6_1/bin/php -q  /Users/thomassteur/Development/piwik/console climulti:request -q --matomo-domain='' --superuser 'module=API&method=CoreAdminHome.archiveReports&idSite=1&period=year&date=2020-01-01&format=json&trigger=archivephp&pid=589576f25540db10b81893639dd3e920d3cf17bb0a07d67bc1484c01a9cd11462f629b38220baf5c29ae302101fc1fd2a6000&runid=84702' > /Users/thomassteur/Development/piwik/tmp/api5.matomo.cloud/climulti/589576f25540db10b81893639dd3e920d3cf17bb0a07d67bc1484c01a9cd11462f629b38220baf5c29ae302101fc1fd2a6000.output 2>&1 &
INFO [2020-07-21 01:14:07] 84702  Archived website id 1, period = year, date = 2020-01-01, segment = '', 114 visits found. Time elapsed: 46.698s
DEBUG [2020-07-21 01:14:07] 84702  No next invalidated archive.
@diosmosis commented on July 29th 2020 Member

@tsteur tests are fixed and the issue you mentioned is fixed, can you review again? there is one apparently random failure that keeps showing up, but not prioritizing it

@tsteur commented on July 30th 2020 Member

@diosmosis some tests seem to be failing now again. Can you check them out and merge once they pass? 🎉

@diosmosis commented on August 2nd 2020 Member

@tsteur build is passing now

This Pull Request was closed on August 4th 2020
Powered by GitHub Issue Mirror