@mattab opened this Issue on August 25th 2017 Member

In Piwik reports are processed based using the raw data. Once an archive has been processed for a website and a specific date period, and stored in the database (during the core:archive process), it can only be invalidated (via HTTP API or CLI or InvalidateReports plugin) which will force all plugins to re-process all their respective reports in the next archiving process. This is problematic because it can take a lot of time to reprocess previous reports for 3 or 12 months.

-> Plugins should be able to easily create their own reports for some of the historical data. this is useful for all plugins which don't create new raw data but re-uses the existing raw data. For example, Funnels plugin would like to process Funnels data in the past 6 months.

To make it possible for any plugin to append their reports to the existing archives, we need to make some changes to the archiving process.

Ideas:

  • Piwik core to allow an existing archive to be modified later-on so new plugin's reports can be part of and appended in the existing old archive
  • Piwik core to allow each plugin to force re-processing of past data and the ability to specify how far back the data should be re-processed. (working for our two use cases: when using the core:archive console command in a cron, or when using broworser triggered archiving and looking at historical data.)

(not related but a bit similar to https://github.com/piwik/piwik/issues/7573 )

@mattab commented on November 29th 2018 Member

When I create Custom reports, and want historic data, I need to invalidate all historical data, and re-process them. It can takes a very long time and is not practical. Recently we had a problem around too much data invalidated (reported in https://github.com/innocraft/plugin-InvalidateReports/issues/8).

So it will be great to have the ability to "update" existing archives and process the data for specific plugins (eg. custom reports) without having to re-process everything else.

@mattab commented on April 2nd 2019 Member

Note: this will be also especially important and valuable for Matomo Cloud service customers where invalidating old data is currently disabled, So users who create custom reports cannot get historical data for their custom reports. Whereas for self-hosted users they can at least manually invalidate reports.

@ibril15 commented on September 3rd 2019

Hi, I'm wondering if this is still planned as part of 3.12 or 3.13? Thanks.

@tsteur commented on September 3rd 2019 Member

I reckon this will be part of Matomo 4 which we might work on after 3.12 but to be seen. This is definitely an issue we want to work on! In the meantime we created a command that let's you get this data quite easily. If you have access to the command line, you could simply execute ./console customreports:archive in the Matomo directory.

Here's an example

./console customreports:archive --idsites=1 --date=last100

There are various options (eg not for segments, only specific periods, ..) and using the latest custom reports you can even do this only for a specific report using --idreport=X

It's quite fast to archive these reports in the past.

@siva538 commented on September 9th 2019 Contributor

@tsteur , Thanks for the update.

Is the below available in 3.11 ?

There are various options (eg not for segments, only specific periods, ..) and using the latest custom reports you can even do this only for a specific report using --idreport=X

Thanks a lot.

@tsteur commented on September 10th 2019 Member

Yes this works with 3.11 . It's in the custom reports plugin

@siva538 commented on September 10th 2019 Contributor

Thank you. That helps. Can you also confirm if this has an equivalent for webAPI? Believe all the console commands have equivalent Web-Based API calls, to make sure this can be done with browser based instead of console.

@tsteur commented on September 10th 2019 Member

There is no API unfortunately as it's working bit differently from the other features.

@siva538 commented on October 4th 2019 Contributor

Hello @tsteur . Sorry for getting back to you on this late.

Validated this in 3.11 version and I don't see this parameter being available in the help documentation.

Here is the screenshot for your reference.

image

For #1 command: 3.11 is the Matomo version

For #2 command: missing --idreport parameter from the help documentation

For #3 command: Custom Reports archiving failing for the usage of --idreport

We have also updated the latest version of custom reports plugin, just to make sure we are not in the old one - (latest available - 3.1.18).

Please let me know if I am missing something here.

Thanks a lot.

@tsteur commented on October 6th 2019 Member

@siva538 it looks like you're not using the latest version of custom reports for some reason. It should definitely have the parameter.

@siva538 commented on October 7th 2019 Contributor

@tsteur , was able to finally get the parameter to working. Thank you!. Turned out to be a caching issue.

Can you please confirm if 3.1.18 is mandatory for this or is available in 3.1.15 as well for the custom reports plugin?

Thanks a lot.

@tsteur commented on October 7th 2019 Member

I think it was added in 3.1.15 from what I see. We always recommend using the latest version though.

@siva538 commented on October 9th 2019 Contributor

Thank you @tsteur

@diosmosis commented on April 22nd 2020 Member

@tsteur I'm going to base this issue's solution off of #15117, and go about it like so:

  • Core archiving
    • Allow archiving single plugins (would be outside of all archive, unless one exists, in which case we modify the existing and remove the other plugin specific)
    • Allow invalidating for single plugins (must include case where archive all archive exists, but we want to invalidate for just one plugin)
    • Add tests
    • Workflow: when the plugin is activated (ie Funnels), invalidate past data (use core INI config? Bounded by hard upper limit if cron archiving is not enabled?) for the specific plugin for N days/months in past. Archiving then picks it up.
  • Cron archiving
    • Support archiving single plugin archives by itself (can’t go through API.get, must go through {plugin}.get)
    • Refactor code in #15117, remove some more code from core:archive to test and test new code as well
  • Browser archiving
    • If an archive Request for a plugin is seen and there is an invalidated plugin specific archive there, archive just the requested plugin(s)
    • tests

Do you see any potential issues w/ this plan?

@tsteur commented on April 23rd 2020 Member

@diosmosis hard to say. I suppose we'd need specific done flags for all plugins and basically create an archive for each plugin and basically no longer use a generic done flag? Or plugins would maybe "define" whether they can retrospectively aggregate data and we'd use a specific done%s.$pluginName flag for these archives? Might not even need this actually.

I suppose ideally this would work for any plugin where data can be generated retrospectively. Especially interesting for Funnels and Custom Reports I guess.

The goal has to be as soon as a new custom report is created, the system would basically notice either through browser archiving or cron archiving that an archive for a specific custom report is missing, and would start archiving these reports. I suppose technically we'd even want this feature should a custom report be updated then indeed we would likely invalidate these plugin specific archives (but not invalidate other data). They don't have generally a $plugin.get API (yet).

https://github.com/matomo-org/matomo/issues/15117 maybe it would make sense to have a separate table for archive invalidations and no longer handle invalidations in the archive table directly but hard to say (I suppose we'd still need to handle a flag whether a specific archive is invalid or not). Wonder if a table like

archive_invalid(idarchive, archive_table_name, archive_name) makes sense instead of having a done flag in the archive table but not sure if that would work... we'd basically assume all archives are OK unless they are in here in which case the archiver knows they need to be reprocessed and the previous archive removed. Would avoid the many reads on all the archive tables. I haven't really thought about it though.

All I can say really is how it should work generally from a user point of view. Hope this helps.

@diosmosis commented on April 23rd 2020 Member

Or plugins would maybe "define" whether they can retrospectively aggregate data and we'd use a specific done%s.$pluginName flag for these archives? Might not even need this actually.

This is sort of my approach, allow invalidating individual plugin archives, then plugins would just invalidate archives and they would get picked up by core:archive.

#15117 maybe it would make sense to have a separate table for archive invalidations and no longer handle invalidations in the archive table directly but hard to say (I suppose we'd still need to handle a flag whether a specific archive is invalid or not). Wonder if a table like archive_invalid(idarchive, archive_table_name, archive_name) makes sense instead of having a done flag in the archive table but not sure if that would work... we'd basically assume all archives are OK unless they are in here in which case the archiver knows they need to be reprocessed and the previous archive removed. Would avoid the many reads on all the archive tables. I haven't really thought about it though.

This could be a good idea maybe... though we'd have to query the table when querying for archive data when browser archiving is enabled. In that case we wouldn't want to use archive data that is invalidated.

@diosmosis commented on April 23rd 2020 Member

@tsteur what do you think about using an invalidations table like:

CREATE TABLE archive_invalidations (
    idarchive INTEGER UNSIGNED NOT NULL,
    name VARCHAR(255) NOT NULL,
    idsite INTEGER UNSIGNED NULL,
    date1 DATE NULL,
    date2 DATE NULL,
    period TINYINT UNSIGNED NULL,
    ts_invalidated DATETIME NULL,
    value DOUBLE NULL,
    PRIMARY KEY(idarchive, name),
    INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_invalidated)
)

The other columns are needed in order to be able to sort the table properly w/o having to look at an archive table simultaneously. Otherwise we'd have to join on an archive table.

When browser archiving is enabled we'd have to join on this table by idarchive/name to check if an archive is invalid.

We could also limit the number of rows we add to this table. Eg, if there are more than 50000 rows or something, just fail the invalidation w/ a warning requesting users to run core:archive. Though this could be an issue for browser archiving... since the user might invalidate an archive then never view it?

I guess we could do both and only add to the table if browser archiving is disabled. This would make the implementation more complicated, but might be worth it if the cost goes down?

@tsteur commented on April 23rd 2020 Member

I reckon a limit shouldn't be needed, but it be awesome that we could show this in the UI. Meaning the number of archives that are invalidated and will need to be reprocessed, and we could even show which reports will be archived soonish.

BTW on the name. I suppose based on this we would then know whether we have to archive the whole site (eg done or done$segmenthash) or a specific plugin such as done.CustomReports?

@tsteur commented on April 23rd 2020 Member

Haven't thought too much about it but looks good. BTW the primary index would be probably fine on idArchive alone?

@diosmosis commented on April 23rd 2020 Member

@tsteur If we allow invalidating individual reports/metrics (or just plugins), then we'd have to allow multiple idarchive/name pairs.

And there's no issue w/ still doing DONE_INVALIDATED for browser archiving? If we don't then the rows in the invalidated table could just keep building or never be deleted.

@tsteur commented on April 23rd 2020 Member

I suppose that would be fine considering it's currently the same behaviour (just spread across multiple tables) and we will be trying to get most users to set up cron archiving in a few months (by improving onboarding)

This Issue was closed on August 4th 2020
Powered by GitHub Issue Mirror