@znerol opened this Issue on November 11th 2019 Contributor

After the update from 3.11.0 to 3.12.0 space requirements of archived data grow massively. Especially the monthly numeric as well as blob archive rows multiplied.

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_10 where idsite=11 and period=3 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|      159 |      3 | 2019-10-01 |
+----------+--------+------------+

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_11 where idsite=11 and period=3 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|    71777 |      3 | 2019-11-01 |
+----------+--------+------------+

Also the generated rows for daily archives are skyrocketing:

MariaDB [piwik]> select count(*), period, date1 from piwik_archive_numeric_2019_11 where idsite=11 and period=1 group by date1, period order by date1;
+----------+--------+------------+
| count(*) | period | date1      |
+----------+--------+------------+
|      157 |      1 | 2019-11-01 |
|      153 |      1 | 2019-11-02 |
|      145 |      1 | 2019-11-03 |
|      147 |      1 | 2019-11-04 |
|      150 |      1 | 2019-11-05 |
|      150 |      1 | 2019-11-06 |
|      681 |      1 | 2019-11-07 |
|      968 |      1 | 2019-11-08 |
|      804 |      1 | 2019-11-09 |
|      692 |      1 | 2019-11-10 |
|     3823 |      1 | 2019-11-11 |
|       24 |      1 | 2019-11-12 |
|       24 |      1 | 2019-11-13 |
+----------+--------+------------+
13 rows in set (0.02 sec)

The update was performed on November 7th in the evening.

From the changelog I gather that archiving was improved a lot since 3.11.0. Which aspect of the archiving process could possibly trigger such a massive increase in space requirements?

@znerol commented on November 11th 2019 Contributor

There is another data point. Just before the update somebody might have setup a new custom date period (appearing as last52 in the archiving logs):

# zgrep -h last52 /var/log/piwik.archiving.log* | sort | head -n10
INFO [2019-11-07 17:00:01] 67756  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 17:42:33] 67756  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 17:45:02] 79313  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:03:37] 79313  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 18:15:02] 87757  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:51:01] 87757  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 19:00:02] 98212  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 19:54:51] 98212  Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 20:00:02] 109382  Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 20:55:42] 109382  Will pre-process for website id = 16, period = day, date = last52

Is it possible that the increase in archive size is related to the new custom date period?

@znerol commented on November 11th 2019 Contributor

Is it possible that the increase in archive size is related to the new custom date period?

Nope. But there seems to be a jump from last2 to last52 for period = day at this point in time:

```INFO [2019-11-07 16:30:02] 62157 Will pre-process for website id = 11, period = day, date = last2
INFO [2019-11-07 16:31:18] 62157 Will pre-process for website id = 16, period = day, date = last2
INFO [2019-11-07 16:45:02] 65551 Will pre-process for website id = 11, period = day, date = last2
INFO [2019-11-07 16:46:24] 65551 Will pre-process for website id = 16, period = day, date = last2
INFO [2019-11-07 17:00:01] 67756 Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 17:42:33] 67756 Will pre-process for website id = 16, period = day, date = last52
INFO [2019-11-07 17:45:02] 79313 Will pre-process for website id = 11, period = day, date = last52
INFO [2019-11-07 18:03:37] 79313 Will pre-process for website id = 16, period = day, date = last52

@znerol commented on November 11th 2019 Contributor

From the logs again. After the update to 3.12.0 it looks like old reports are invalidated / recreated on every cron run with the default 52 days period.

INFO [2019-11-07 17:00:01] 67756  Starting Matomo reports archiving...
INFO [2019-11-07 17:00:01] 67756  Old report was invalidated for website id 11
INFO [2019-11-07 17:00:01] 67756  Will pre-process for website id = 11, period = day, date = last52
@znerol commented on November 11th 2019 Contributor

It looks like the archived data is invalidated on every cron run starting with 3.12.0:

zgrep -h "Will invalidate" /var/log/piwik.archiving.log* | sort |head
INFO [2019-11-07 17:00:01] 67756  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 17:45:01] 79313  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 18:15:01] 87757  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 19:00:01] 98212  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 20:00:01] 109382  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11
INFO [2019-11-07 21:00:01] 121358  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 21:45:02] 843  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-07 22:30:02] 10519  - Will invalidate archived reports for 2019-11-07 for following websites ids: 11,16
INFO [2019-11-08 00:15:02] 23662  - Will invalidate archived reports for 2019-11-08 for following websites ids: 11
INFO [2019-11-08 01:30:01] 39905  - Will invalidate archived reports for 2019-11-08 for following websites ids: 11

This did not happen before. Also I do not quite understand why the invalidation triggers archival of last52 days on every cron run.

@znerol commented on November 11th 2019 Contributor

This might be related to #14639

@diosmosis ping, do you have an idea on how to proceed with troubleshooting?

@znerol commented on November 11th 2019 Contributor

Stepping through the debugger I find the following in CronArchive.php#L870

        // when some data was purged from this website
        // we make sure we query all previous days/weeks/months
        $processDaysSince = $lastTimestampWebsiteProcessedDay;
        if ($websiteInvalidatedShouldReprocess
            // when --force-all-websites option,
            // also forces to archive last52 days to be safe
            || $this->shouldArchiveAllSites) {
            $processDaysSince = false;
        }

        $date = $this->getApiDateParameter($idSite, "day", $processDaysSince);

Since $websiteInvalidatedShouldReprocess is true, $processDaysSince is set to false. This results in turn results in self::DEFAULT_DATE_LAST being returned from getDateLastN() through getApiDateParameter() (unless the matomo site was added more recently).

If I understand things correctly, then site invalidation always was inefficient, but #14639 just revealed that by making it the default behavior.

@toredash commented on November 11th 2019 Contributor

@znerol the last2/last52 bug is already discussed here: #15086

@tsteur commented on November 11th 2019 Member
This Issue was closed on November 11th 2019
Powered by GitHub Issue Mirror