@MISTAnalytics opened this Issue on December 10th 2021

Challenge and feature idea

We have the Matomo archiver logs to see whether an archiving process was successfully ended and we see the site ID and segment in there, as well as the time it took to complete the archiving for this particular part of the archiving process.

However, we have some very long-lasting archiving processes of hours and it would be great to have a way to see the current progress in the UI or in the log file and be able to see how much time has passed for each active archiving process AND to see how much time Matomo expects to need before a given site ID and/or segment is processed.

In addition, a checkmark would be great that provides a sort of confirmation of whether the data of a given day or period has been fully archived and processed.

It would really add a lot of reliability on the data being available in Matomo.

Possible solution

Add the bold output to the log file somehow. Below some examples on how this might be achieved:

[32mINFO [2021-11-26 13:11:53] 26496 [39m Start processing archives for site 3.
[32mINFO [2021-11-26 13:11:53] 26496 [39m Will invalidate archived reports for today in site ID = 3's timezone (2021-11-26 00:00:00).
[32mINFO [2021-11-26 13:11:53] 26496 [39m Will invalidate archived reports for yesterday in site ID = 3's timezone (2021-11-25 00:00:00).
[32mINFO [2021-11-26 13:11:54] 26496 [39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Time expected: 1.000s
[32mINFO [2021-11-26 13:11:54] 26496 [39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Progress: 45%
[32mINFO [2021-11-26 13:11:54] 26496 [39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Time elapsed: 1.077s
[32mINFO [2021-11-26 13:11:57] 26496 [39m Archived website id 3, period = day, date = 2021-11-26, segment = 'countryCode==nl', 8 visits found. Time elapsed: 1.692s
[32mINFO [2021-11-26 13:11:57] 26496 [39m Archived website id 3, period = day, date = 2021-11-26, segment = 'pageUrl=^https%3A%2F%2Fwww.test.nl', 5 visits found. Time elapsed: 1.692s
[32mINFO [2021-11-26 13:12:11] 26496 [39m Archived website id 3, period = year, date = 2021-01-01, segment = 'countryCode==nl', 4012 visits found. Time elapsed: 3.338s
[32mINFO [2021-11-26 13:12:11] 26496 [39m Finished archiving for site 3, 16 API requests, Time elapsed: 17.929s [3 / 9 done] Period 2021-01-01-2021-11-26 completely archived and ready for analysis.

@MISTAnalytics commented on December 22nd 2021

Sometimes the archiving process takes many hours/days to archive all periods and segments (with lots of data).

In the logs, this message is shown: Finished archiving for site 3, 16 API requests, Time elapsed: 17.929s [3 / 9 done]

From the example above, the bolded output is what can be seen when a given site is completed. It shows the total amount of time it took to process the given site and how many sites have been completed out of all of the sites it will process in this core:archive run.

But if you know the average amount of time each of the sites takes to process (average of the last five runs in example), you can likely have a rough estimate on how much time is remaining before the archiver is complete (and yes, we need to keep in mind that this can change depending on a quite large number of factors). We can make a distinction between a full archived siteID and a specific segment. This would provide a lot of additional insights into the archive processing.

We might consider to write this data to another, new table in the DB of Matomo? If we run this process multiple times and know what the average times are (in the last five runs in example), an additional step would be to show the expected time and progress based upon the last X runs.

This would mean users are able to know when to expect the archiving to be complete, based upon a calculated estimation.

@tsteur commented on December 22nd 2021 Member

Thanks for the feature suggestion @MISTAnalytics

I'm not quite sure about the value yet. Is the flow that someone would want to know when the reports are available, and then asks a sysadmin to check the logs when it will be available? And if that site/period they are after is currently available, then you would be able to tell based on the logs when the report will be finished?

Assuming the archiving always takes the same time, and they always start around the same time, maybe it's also in general eventually roughly possible to know when they will become available roughly without this information in the logs maybe?

Is this mostly interesting for certain periods like day? and maybe mostly for dates that include "Today"?

Generally writing this data to the UI might be tricky as there are a lot of variables in there and predicting it can be extremely hard. For example if segments were changed or edited, then it can be taking longer for certain data to become available.

To better understand things, can you let us know how often archiving is launched, how many sites there are, and whether there were any customisations for the [General]time_before_*_archive_considered_outdated setting?

Be great to learn more about this.

@MISTAnalytics commented on January 7th 2022

@tsteur Please find my response below.

I'm not quite sure about the value yet. Is the flow that someone would want to know when the reports are available, and then asks a sysadmin to check the logs when it will be available? And if that site/period they are after is currently available, then you would be able to tell based on the logs when the report will be finished?

Assuming the archiving always takes the same time, and they always start around the same time, maybe it's also in general eventually roughly possible to know when they will become available roughly without this information in the logs maybe?

Yes, this is the idea. To have a rough, calculated estimation on when the archiving process would be finished and the reports would be fully ready for a given period. This would be based upon five or ten previous runs of the archiver so that it is able to 'learn' how long on average certain siteIDs and segments take. To write this in the logs is sufficient, it does not necessarily have to be in the UI.

Is this mostly interesting for certain periods like day? and maybe mostly for dates that include "Today"?

Yes, definitely day periods.

Generally writing this data to the UI might be tricky as there are a lot of variables in there and predicting it can be extremely hard. For example if segments were changed or edited, then it can be taking longer for certain data to become available.

These factors could be mentioned as i.e.: 'Be careful, there are a lot of variables which have an effect on the archiving process. Please be sure you did not change or edit any segments, dimensions, API requests etc.'

In addition, as mentioned it does not have to be showed in the UI. At least in the logs is sufficient for now and already very helpful.

_To better understand things, can you let us know how often archiving is launched, how many sites there are, and whether there were any customisations for the [General]timebefore*_archive_consideredoutdated setting?

The archiving process is launched daily or per couple of days. There are around 50 sites of which two are consuming most of the archiving time. Yes, there are some modifications to the 'archive_considered_outdated' setting.

However, this is a feature to add in general, not to solve an issue. So it is not for this particular client.

@jorgeuos commented on February 7th 2022

This is something we've been wanting for a really long time. But I'm unsure if Mysql/MariaDB supports progress for INSERT's. So you would probably have to calculate it somehow, maybe it would be wise to add a column with duration_time per report and you can get an average generation time for that specific report.
Would be useful to identify slow queries too. E.g if they are using contains on huge data sets, which translates too SELECT FROM WHERE something LIKE "%needle%".

Edit:
Keep metrics for your metrics 😉

@gerronmulder commented on February 17th 2022

I can see huge benefits for having insights into the archiving process:

  • Knowing the processing time of specific segments creates awareness with Matomo users. This would enable balancing performance impact vs. usability of segments.
    • Knowing the steps taken (create tmp table, insert data, ..) by the archiving process will enable more specific optimizations. Most definitely when it becomes easier to see the conversion from a segment to a SQL-statement.

For larger Matomo instances it becomes key to be able to manage the archiving process. Especially for segments.

It would already help a lot of the archiving process gets extended with many more events/hooks so that we can develop a plugin more easily to gain insight.

@tsteur commented on February 17th 2022 Member

@gerronmulder feel free to create a PR with the events that you need and we will be happy to review and merge. That be great! It be hard for us add events not knowing if they will be actually helpful in the end as we wouldn't know which ones are needed exactly. If you were to work on this as part of a plugin, you could also consider creating a PR in core for the entire solution and we would review too.

FYI here are some existing cron events: https://developer.matomo.org/api-reference/events#cronarchive

Powered by GitHub Issue Mirror