Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matomo progression bar for archiving process #18486

Open
MISTAnalytics opened this issue Dec 10, 2021 · 7 comments
Open

Matomo progression bar for archiving process #18486

MISTAnalytics opened this issue Dec 10, 2021 · 7 comments
Labels
c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Stability For issues that make Matomo more stable and reliable to run for sys admins.

Comments

@MISTAnalytics
Copy link

MISTAnalytics commented Dec 10, 2021

Challenge and feature idea

We have the Matomo archiver logs to see whether an archiving process was successfully ended and we see the site ID and segment in there, as well as the time it took to complete the archiving for this particular part of the archiving process.

However, we have some very long-lasting archiving processes of hours and it would be great to have a way to see the current progress in the UI or in the log file and be able to see how much time has passed for each active archiving process AND to see how much time Matomo expects to need before a given site ID and/or segment is processed.

In addition, a checkmark would be great that provides a sort of confirmation of whether the data of a given day or period has been fully archived and processed.

It would really add a lot of reliability on the data being available in Matomo.

Possible solution

Add the bold output to the log file somehow. Below some examples on how this might be achieved:

�[32mINFO [2021-11-26 13:11:53] 26496 �[39m Start processing archives for site 3.
�[32mINFO [2021-11-26 13:11:53] 26496 �[39m Will invalidate archived reports for today in site ID = 3's timezone (2021-11-26 00:00:00).
�[32mINFO [2021-11-26 13:11:53] 26496 �[39m Will invalidate archived reports for yesterday in site ID = 3's timezone (2021-11-25 00:00:00).
�[32mINFO [2021-11-26 13:11:54] 26496 �[39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Time expected: 1.000s
�[32mINFO [2021-11-26 13:11:54] 26496 �[39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Progress: 45%
�[32mINFO [2021-11-26 13:11:54] 26496 �[39m Archived website id 3, period = day, date = 2021-11-26, segment = '', 14 visits found. Time elapsed: 1.077s
�[32mINFO [2021-11-26 13:11:57] 26496 �[39m Archived website id 3, period = day, date = 2021-11-26, segment = 'countryCode==nl', 8 visits found. Time elapsed: 1.692s
�[32mINFO [2021-11-26 13:11:57] 26496 �[39m Archived website id 3, period = day, date = 2021-11-26, segment = 'pageUrl=^https%3A%2F%2Fwww.test.nl', 5 visits found. Time elapsed: 1.692s
�[32mINFO [2021-11-26 13:12:11] 26496 �[39m Archived website id 3, period = year, date = 2021-01-01, segment = 'countryCode==nl', 4012 visits found. Time elapsed: 3.338s
�[32mINFO [2021-11-26 13:12:11] 26496 �[39m Finished archiving for site 3, 16 API requests, Time elapsed: 17.929s [3 / 9 done] Period 2021-01-01-2021-11-26 completely archived and ready for analysis.

@MISTAnalytics MISTAnalytics added the Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. label Dec 10, 2021
@MISTAnalytics
Copy link
Author

Sometimes the archiving process takes many hours/days to archive all periods and segments (with lots of data).

In the logs, this message is shown: Finished archiving for site 3, 16 API requests, Time elapsed: 17.929s [3 / 9 done]

From the example above, the bolded output is what can be seen when a given site is completed. It shows the total amount of time it took to process the given site and how many sites have been completed out of all of the sites it will process in this core:archive run.

But if you know the average amount of time each of the sites takes to process (average of the last five runs in example), you can likely have a rough estimate on how much time is remaining before the archiver is complete (and yes, we need to keep in mind that this can change depending on a quite large number of factors). We can make a distinction between a full archived siteID and a specific segment. This would provide a lot of additional insights into the archive processing.

We might consider to write this data to another, new table in the DB of Matomo? If we run this process multiple times and know what the average times are (in the last five runs in example), an additional step would be to show the expected time and progress based upon the last X runs.

This would mean users are able to know when to expect the archiving to be complete, based upon a calculated estimation.

@tsteur tsteur added this to the Priority Backlog (Help wanted) milestone Dec 22, 2021
@tsteur
Copy link
Member

tsteur commented Dec 22, 2021

Thanks for the feature suggestion @MISTAnalytics

I'm not quite sure about the value yet. Is the flow that someone would want to know when the reports are available, and then asks a sysadmin to check the logs when it will be available? And if that site/period they are after is currently available, then you would be able to tell based on the logs when the report will be finished?

Assuming the archiving always takes the same time, and they always start around the same time, maybe it's also in general eventually roughly possible to know when they will become available roughly without this information in the logs maybe?

Is this mostly interesting for certain periods like day? and maybe mostly for dates that include "Today"?

Generally writing this data to the UI might be tricky as there are a lot of variables in there and predicting it can be extremely hard. For example if segments were changed or edited, then it can be taking longer for certain data to become available.

To better understand things, can you let us know how often archiving is launched, how many sites there are, and whether there were any customisations for the [General]time_before_*_archive_considered_outdated setting?

Be great to learn more about this.

@MISTAnalytics
Copy link
Author

@tsteur Please find my response below.

I'm not quite sure about the value yet. Is the flow that someone would want to know when the reports are available, and then asks a sysadmin to check the logs when it will be available? And if that site/period they are after is currently available, then you would be able to tell based on the logs when the report will be finished?

Assuming the archiving always takes the same time, and they always start around the same time, maybe it's also in general eventually roughly possible to know when they will become available roughly without this information in the logs maybe?

Yes, this is the idea. To have a rough, calculated estimation on when the archiving process would be finished and the reports would be fully ready for a given period. This would be based upon five or ten previous runs of the archiver so that it is able to 'learn' how long on average certain siteIDs and segments take. To write this in the logs is sufficient, it does not necessarily have to be in the UI.

Is this mostly interesting for certain periods like day? and maybe mostly for dates that include "Today"?

Yes, definitely day periods.

Generally writing this data to the UI might be tricky as there are a lot of variables in there and predicting it can be extremely hard. For example if segments were changed or edited, then it can be taking longer for certain data to become available.

These factors could be mentioned as i.e.: 'Be careful, there are a lot of variables which have an effect on the archiving process. Please be sure you did not change or edit any segments, dimensions, API requests etc.'

In addition, as mentioned it does not have to be showed in the UI. At least in the logs is sufficient for now and already very helpful.

To better understand things, can you let us know how often archiving is launched, how many sites there are, and whether there were any customisations for the [General]time_before*archive_considered_outdated setting?

The archiving process is launched daily or per couple of days. There are around 50 sites of which two are consuming most of the archiving time. Yes, there are some modifications to the 'archive_considered_outdated' setting.

However, this is a feature to add in general, not to solve an issue. So it is not for this particular client.

@jorgeuos
Copy link

jorgeuos commented Feb 7, 2022

This is something we've been wanting for a really long time. But I'm unsure if Mysql/MariaDB supports progress for INSERT's. So you would probably have to calculate it somehow, maybe it would be wise to add a column with duration_time per report and you can get an average generation time for that specific report.
Would be useful to identify slow queries too. E.g if they are using contains on huge data sets, which translates too SELECT FROM WHERE something LIKE "%needle%".

Edit:
Keep metrics for your metrics 😉

@gerronmulder
Copy link

I can see huge benefits for having insights into the archiving process:

  • Knowing the processing time of specific segments creates awareness with Matomo users. This would enable balancing performance impact vs. usability of segments.
  • Knowing the steps taken (create tmp table, insert data, ..) by the archiving process will enable more specific optimizations. Most definitely when it becomes easier to see the conversion from a segment to a SQL-statement.

For larger Matomo instances it becomes key to be able to manage the archiving process. Especially for segments.

It would already help a lot of the archiving process gets extended with many more events/hooks so that we can develop a plugin more easily to gain insight.

@tsteur
Copy link
Member

tsteur commented Feb 17, 2022

@gerronmulder feel free to create a PR with the events that you need and we will be happy to review and merge. That be great! It be hard for us add events not knowing if they will be actually helpful in the end as we wouldn't know which ones are needed exactly. If you were to work on this as part of a plugin, you could also consider creating a PR in core for the entire solution and we would review too.

FYI here are some existing cron events: https://developer.matomo.org/api-reference/events#cronarchive

@atom-box
Copy link

A customer wrote:

I want to be able to see if the calculations are finished. Then I can start evaluating the custom reports again.

the calculation can take an hour or a day, depending on the number of invalidated reports. It is not possible to predict exactly when the reports will be ready. Especially if all reports of a website are invalidated for the last 24 months or longer.

An information about whether reports are still being calculated would be nice. e.g. a banner at the place of the "warning" in the screenshot with a text like "recalculation still in progress".

@mattab mattab added c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Stability For issues that make Matomo more stable and reliable to run for sys admins. labels Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Stability For issues that make Matomo more stable and reliable to run for sys admins.
Projects
None yet
Development

No branches or pull requests

6 participants