Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive growth of archive data after update to 3.12.0 #15145

Closed
znerol opened this issue Nov 11, 2019 · 8 comments
Closed

Massive growth of archive data after update to 3.12.0 #15145

znerol opened this issue Nov 11, 2019 · 8 comments
Labels
duplicate For issues that already existed in our issue tracker and were reported previously.

Comments

@znerol
Copy link
Contributor

znerol commented Nov 11, 2019

After the update from 3.11.0 to 3.12.0 space requirements of archived data grow massively. Especially the monthly numeric as well as blob archive rows multiplied.

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_10 where idsite=11 and period=3 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|      159 |      3 | 2019-10-01 |
+----------+--------+------------+

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_11 where idsite=11 and period=3 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|    71777 |      3 | 2019-11-01 |
+----------+--------+------------+

Also the generated rows for daily archives are skyrocketing:

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_11 where idsite=11 and period=1 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|      157 |      1 | 2019-11-01 |
|      153 |      1 | 2019-11-02 |
|      145 |      1 | 2019-11-03 |
|      147 |      1 | 2019-11-04 |
|      150 |      1 | 2019-11-05 |
|      150 |      1 | 2019-11-06 |
|      681 |      1 | 2019-11-07 |
|      968 |      1 | 2019-11-08 |
|      804 |      1 | 2019-11-09 |
|      692 |      1 | 2019-11-10 |
|     3823 |      1 | 2019-11-11 |
|       24 |      1 | 2019-11-12 |
|       24 |      1 | 2019-11-13 |
+----------+--------+------------+
13 rows in set (0.02 sec)

The update was performed on November 7th in the evening.

From the changelog I gather that archiving was improved a lot since 3.11.0. Which aspect of the archiving process could possibly trigger such a massive increase in space requirements?

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

There is another data point. Just before the update somebody might have setup a new custom date period (appearing as last52 in the archiving logs):

# zgrep -h last52 /var/log/piwik.archiving.log* | sort | head -n10
INFO [2019-11-07 17:00:01] 67756  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 17:42:33] 67756  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 17:45:02] 79313  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:03:37] 79313  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 18:15:02] 87757  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:51:01] 87757  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 19:00:02] 98212  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 19:54:51] 98212  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 20:00:02] 109382  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 20:55:42] 109382  Will pre-process for website id = 16, period = day, date = last52

Is it possible that the increase in archive size is related to the new custom date period?

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

Is it possible that the increase in archive size is related to the new custom date period?

Nope. But there seems to be a jump from last2 to last52 for period = day at this point in time:

INFO [2019-11-07 16:31:18] 62157  Will pre-process for website id = 16, period = day, date = last2
INFO [2019-11-07 16:45:02] 65551  Will pre-process for website id = 11, period = day, date = last2
INFO [2019-11-07 16:46:24] 65551  Will pre-process for website id = 16, period = day, date = last2
INFO [2019-11-07 17:00:01] 67756  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 17:42:33] 67756  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 17:45:02] 79313  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:03:37] 79313  Will pre-process for website id = 16, period = day, date = last52

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

From the logs again. After the update to 3.12.0 it looks like old reports are invalidated / recreated on every cron run with the default 52 days period.

INFO [2019-11-07 17:00:01] 67756  Starting Matomo reports archiving...
INFO [2019-11-07 17:00:01] 67756  Old report was invalidated for website id 11
INFO [2019-11-07 17:00:01] 67756  Will pre-process for website id = 11, period = day, date = last52

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

It looks like the archived data is invalidated on every cron run starting with 3.12.0:

zgrep -h "Will invalidate" /var/log/piwik.archiving.log* | sort |head
INFO [2019-11-07 17:00:01] 67756  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 17:45:01] 79313  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 18:15:01] 87757  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 19:00:01] 98212  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 20:00:01] 109382  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11
INFO [2019-11-07 21:00:01] 121358  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 21:45:02] 843  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 22:30:02] 10519  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-08 00:15:02] 23662  - Will invalidate archived reports for 2019-11-08 for following websites ids: 11
INFO [2019-11-08 01:30:01] 39905  - Will invalidate archived reports for 2019-11-08 for following websites ids: 11

This did not happen before. Also I do not quite understand why the invalidation triggers archival of last52 days on every cron run.

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

This might be related to #14639

@diosmosis ping, do you have an idea on how to proceed with troubleshooting?

@znerol
Copy link
Contributor Author

znerol commented Nov 11, 2019

Stepping through the debugger I find the following in CronArchive.php#L870

        // when some data was purged from this website
        // we make sure we query all previous days/weeks/months
        $processDaysSince = $lastTimestampWebsiteProcessedDay;
        if ($websiteInvalidatedShouldReprocess
            // when --force-all-websites option,
            // also forces to archive last52 days to be safe
            || $this->shouldArchiveAllSites) {
            $processDaysSince = false;
        }

        $date = $this->getApiDateParameter($idSite, "day", $processDaysSince);

Since $websiteInvalidatedShouldReprocess is true, $processDaysSince is set to false. This results in turn results in self::DEFAULT_DATE_LAST being returned from getDateLastN() through getApiDateParameter() (unless the matomo site was added more recently).

If I understand things correctly, then site invalidation always was inefficient, but #14639 just revealed that by making it the default behavior.

@toredash
Copy link
Contributor

@znerol the last2/last52 bug is already discussed here: #15086

@Findus23 Findus23 added the c: Performance For when we could improve the performance / speed of Matomo. label Nov 11, 2019
@tsteur
Copy link
Member

tsteur commented Nov 11, 2019

Closing it as duplicate of #15145 and #10439

@tsteur tsteur closed this as completed Nov 11, 2019
@tsteur tsteur added answered For when a question was asked and we referred to forum or answered it. duplicate For issues that already existed in our issue tracker and were reported previously. and removed c: Performance For when we could improve the performance / speed of Matomo. answered For when a question was asked and we referred to forum or answered it. labels Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate For issues that already existed in our issue tracker and were reported previously.
Projects
None yet
Development

No branches or pull requests

4 participants