@diosmosis opened this Pull Request on April 30th 2020 Member

Fixes #11974

This PR is based off of #15499 .

Changes made:

  • we allow invalidating plugin specific archives, this results in entries being added to archive_invalidations for the plugin archive. they will not be added if we invalidate a normal archive, even if as part of that invalidation some dependent plugin specific archives (like ones created by VisitorInterest/Goals) are invalidated.
  • in core:archive, if we see a plugin specific archive in the invalidations table, we archive it separately w/ the pluginOnly=1 query parameter.
  • in Rules.php if in core:archive and pluginOnly=1 is used, we force creating a plugin specific archive.
  • in Archive.php/Loader.php/etc. we don't just look for one type of archive (ie, plugin vs all), we look through every one and use the latest archive values. so if there is an all archive and a ExamplePlugin archive that is more recent, the archive data in ExamplePlugin will be used for ExamplePlugin data, but other plugin data will come from all.
  • plugins on activation can then invalidate archives for their own plugin to schedule archiving for data in the past, without initiating archiving for every other plugin at the same time.

Notes:

  • browser archiving is not supported. If a user does not use core:archive, the invalidations don't do anything unless the plugin specific archive existed before. I think this is ok since premium plugins need this feature and we don't expect them to be used w/o core:archive, correct?

Example PR can be found in funnels

@diosmosis commented on April 30th 2020 Member

@tsteur this PR is based off the other one so might be hard to review, but it's ready for an initial one. once that's done and it looks good to be in core, then I'll probably refactor CronArchive to test more of its internals. Don't want to do this now since both PRs are already pretty significant.

@tsteur commented on May 6th 2020 Member

@diosmosis just fyi started reviewing but it's bit difficult as it is based on https://github.com/matomo-org/matomo/pull/15499/files so I reckon we will be trying to get #15499 merged first which will make it easier

@diosmosis commented on May 7th 2020 Member

@tsteur you should be able to review this one now

@tsteur commented on May 8th 2020 Member

@diosmosis had a first look for a while and starting to understand few things. Must have been so hard to develop!

I've had a look at the funnels pull request to see how plugins basically trigger rearchiving of data for that plugin. Do I basically see this right that when eg someone adds or changes a custom report, then custom reports will mark archives as invalid like below?

     $invalidator->markArchivesAsInvalidated($idSites, $dates, 'month', $segment = null, $cascadeDown = false, $invalidateRanges = false,
            $plugin = 'CustomReports');

I wonder if we need something similar than process_new_segments_from where we configure how far back by default we rearchive things? like process_plugin_specific_archives_from=beginning_of_time|lastN. I guess last_edit_time and creation_time wouldn't quite work because the creation time is always different for each custom report, each funnel, each ... We could then maybe have a method in core that automatically goes back the number of months as needed and all the plugin would need to do be like StaticContainer::get(ArchiveInvalidator::class)->rearchivePlugin($pluginName, $idSite). Basically, the funnels activate method be mostly in core.

In Funnels plugin we wouldn't run this on activate btw. but instead when a new funnel is added or changed. Some for custom reports when a custom report was added or changed. A/B testing might also trigger later a rearchive when the A/B test was changed etc.

I suppose the plugin would always need to rearchive everything? Say there are 20 custom reports and a new one was added. I suppose we'd then need to rearchive all custom reports for the given date range? I reckon, if we know in the Custom Reports Archiver that a plugin specific archive is requested, we could maybe check which custom reports were changed or created recently and only archive these but not 100% sure yet if that works.

@tsteur commented on May 8th 2020 Member

I suppose we would also need to have a flag whether to also reprocess all segments for the given plugin or not.

@diosmosis commented on May 11th 2020 Member

@tsteur

looking at it again right now my biggest concern is performance and IO. Do you think with this logic it is possible to only trigger an archive for a specific part within a plugin?

Yes, this is possible via standardizing the setArchiveOnlyReport() approach. Plugins will have to handle this themselves though.

Will the archiver know which archives need to be regenerated eg 900 seconds (see time_before_today_archive_considered_outdated) later at 8:22pm? (We'd expect the archiver not to trigger a new archive because the last full archive was done at 8:20pm maybe (120 seconds later only))

I'm not sure I 100% understand, but here some possible cases:

  • an invalidate for all plugins for same idsite/period is set
  • an invalidate for funnels (ID 4) is set
  • archiver picks up all plugins archive first
  • then plugin specific

what happens: not sure exactly what happens here, but ideally we would skip the plugin specific archive. in fact we can probably add this logic if it's not there already (or move it if it is) to canSkipThisArchive(), which will apply it in every place that matters.

  • a plugin specific archive for funnels (funnel ID 4) is archived first at 8pm
  • then we start an all plugins archive at say 8:20pm

what happens: this logic isn't implemented, but here's what could happen. first setArchiveOnlyReport() takes an array of reports. second, in PluginsArchiver before archiving we select every blob/numeric name in the queried idarchives w/ ts_archived that is older than time_before_today_archive_considered_outdated. we set that as setArchiveOnlyReport() and archive each plugin. this might be overkill though given the plugin specific re-archive shouldn't happen often. anyway, then the all plugins archive won't contain the still valid data.

And will the logic always use the latest archive for a funnels? It would need to use the one from Archive ID 5 in this case and ignore the plugin specific report?

That is the intent. When querying archive data we're sorting by idarchive desc, so it should pick the data w/ the latest idarchive. We could make it more specific by sorting by ts_archived desc.

@diosmosis commented on May 11th 2020 Member

When querying archive data we're sorting by idarchive desc, so it should pick the data w/ the latest idarchive.

Actually we're not doing that but we're supposed to be.

@tsteur commented on May 12th 2020 Member

Yes, this is possible via standardizing the setArchiveOnlyReport() approach. Plugins will have to handle this themselves though.

Does this mean in the archive invalidate logic we'd also store an ID and then somehow instead of $plugin.get we trigger some CoreAdmin.archivePlugin(pluginName, idToArchive) method or so? Then this API for example directly creates an instance of the archiver and triggers the archive for that plugin (and calls setArchiveOnlyReport) should an API be specified? I reckon if that exists things could/should work.

BTW be good to also add things mentioned in https://github.com/matomo-org/matomo/pull/15889#issuecomment-626714794

@diosmosis commented on May 12th 2020 Member

Does this mean in the archive invalidate logic we'd also store an ID and then somehow instead of $plugin.get we trigger some CoreAdmin.archivePlugin(pluginName, idToArchive) method or so?

Something like this, but it's possible we invalidate an archive that doesn't exist, in which case there won't be an ID.

However if you're assuming the code will just replace the existing report in an archive, then this isn't that simple due to what we discussed in slack. If not, can you think about my comments on creating the new plugin specific archive?

@tsteur commented on May 13th 2020 Member

Something like this, but it's possible we invalidate an archive that doesn't exist, in which case there won't be an ID.

@diosmosis not sure what you mean here? I was thinking eg we could add an ID to the archive invalidate table and a plugin may provide an additional ID when invalidating a report which the CronArchive would then forward when archiving that invalidated archive entry. Not sure it's clear what I mean? We can still add a new archiveId and not alter an existing archive.

Powered by GitHub Issue Mirror