@mkindred opened this Issue on May 20th 2021

This does involve the custom reports plugin, but the issue is during archiving, so I'm not sure if the culprit is custom reports, goals, or core.

Expected Behavior

Archiving processes run from hourly cron should complete.

Current Behavior

I have had several archive processes that start at each cron archive run (hourly), but never complete. They ran for hundreds of hours or until I killed the process manually. Since this first started happening, my archive tables have been growing to be quite large and full of duplicates.

The two archive processes in question were yearly archives for custom reports, 2020-01. Assuming that the requestedReport parameter in the archive command corresponds to the custom report ID, it seems that both of the custom reports in question are fairly new. My guess is that I created the new custom reports and then invalidated six months of archives, which would overlap into 2020. At that point, I'm guessing that the 2020 yearly archives would run again. But I don't understand why they wouldn't complete.

Possible Solution

To stop the archive tables from growing and prevent the server from running out of resources, I deleted the two custom reports in question. That seems to have solved the initial problem.

Steps to Reproduce (for Bugs)

I'm not sure that this is easily reproducible. I want to try to wrangle all the archive tables back down to manageable levels before I attempt it. But here is what I remember of my process.

  1. Set up a goal
  2. Create a custom report including the goal
  3. Invalidate archive for a period of time that would overlap with a year where the goal and custom report didn't exist.

Your Environment

  • Matomo Version: 4.2.1
  • PHP Version: 7.3.28, cli memory_limit => -1
  • Server Operating System: Debian 9
  • Additionally installed plugins: CustomReports (v4.0.7), DisableTracking (v1.2.0), InvalidateReports (v4.0.1), MarketingCampaignsReporting (v4.0.4)

Matomo settings

  • Never deleting old raw data
  • Never deleting old aggregated report data
  • Cron archive once per hour

FWIW site 1 gets 500k pageviews / year, and site 2 gets 2.5 million pageviews / year.

Questions

Do my steps to reproduce above point to an obvious error in my process? Ideally I'd like to create the custom reports and invalidate archives again, but I don't want to risk having the same thing happen.

Now that I've deleted the custom reports and things are running OK, I have very large archive tables:

'matomo_archive_blob_2020_01', '3,428 MB', '13,947,541'
'matomo_archive_numeric_2020_01', '1,499 MB', '7,999,953'

I seem to have millions of duplicate rows in these two tables. I ran the following to find duplicates:

SELECT
  idsite, date1, date2, period, name,
  COUNT(*) as count
FROM matomo_archive_blob_2020_01
GROUP BY idsite, date1, date2, period, name
HAVING count > 1 ORDER BY count DESC;

... resulting in the following (truncated for brevity):

'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsCategory', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsCategory_Cart', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsName', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsName_Cart', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsSku', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goals_ItemsSku_Cart', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goal_days_until_conv', '49523'
'1', '2020-01-06', '2020-01-06', '1', 'Goal_visits_until_conv', '49523'
'1', '2020-01-07', '2020-01-07', '1', 'Goals_ItemsCategory', '49519'
'1', '2020-01-07', '2020-01-07', '1', 'Goals_ItemsCategory_Cart', '49519'
'1', '2020-01-07', '2020-01-07', '1', 'Goals_ItemsName', '49519'
...etc

I've brought the code and dbase down into a local docker environment to test whether I can purge. But I'm running into lots of issues with group_concat errors (GROUP_CONCAT cut off the query result, you may have to purge archives again). Based on the number of duplicate rows above, I'm assuming that maybe this is trying to create an SQL statement involving 45,000 rows?

Do I need to change my MySQL settings in order to allow the purging and optimizing to continue?

@diosmosis commented on May 20th 2021 Member

@mkindred this sounds like it would be fixed by https://github.com/matomo-org/matomo/pull/17244, which is included in 4.3.0. Can you see if applying this change then running core:purge-old-archive-data helps? (Ideally if you could, updating to 4.3.0 would be better, as it has many other fixes around archiving.)

@mkindred commented on May 21st 2021

Yep, I'll upgrade. I wanted to hold off on upgrading until I had stopped the table growth.

Just to help with my understanding, can you confirm:

The automatic purging of invalidated archives doesn't happen until the archive in question completes, correct? And I assume this probably also applies to a manual purging. Those tables aren't going to shrink if there's not a completed archive for the same site and period.

Also, will the upgrade to 4.3 help with the GROUP_CONCAT errors, or will I need to change my.conf to help these purges?

@diosmosis commented on May 21st 2021 Member

@mkindred

The automatic purging of invalidated archives doesn't happen until the archive in question completes, correct?

The purging process deletes archives that are older than the newest ones. The newest ones do not get deleted, and the newest one has to be marked as complete to be considered the newest.

Also, will the upgrade to 4.3 help with the GROUP_CONCAT errors, or will I need to change my.conf to help these purges?

It should help with it, the code will try and set the session group_concat_max_len before the query. If this works for the mysql user you are using, it will make the problem much less likely to occur. However, since your tables are already rather large, you may need to manually purge a couple times to get things under control first.

@mkindred commented on May 27th 2021

Update:

  • I upgraded Matomo to 4.3.0 in my local docker environment.
  • I deleted the two custom reports. This seems to have allowed my archiving to complete.
  • Once I saw that, I deleted the same two reports from production, and archiving is completing there, as well. That's good to see.

But I still have millions of duplicate rows in the archive tables associated with the 2020-01 yearly archives:

MariaDB [php]> SELECT
    ->   idsite, date1, date2, period, name,
    ->   COUNT(*) as count
    -> FROM matomo_archive_blob_2020_01
    -> GROUP BY idsite, date1, date2, period, name
    -> HAVING count > 1 ORDER BY count DESC
    -> LIMIT 25;
+--------+------------+------------+--------+--------------------------+-------+
| idsite | date1      | date2      | period | name                     | count |
+--------+------------+------------+--------+--------------------------+-------+
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsCategory      | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsCategory_Cart | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsName          | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsName_Cart     | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsSku           | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsSku_Cart      | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goal_days_until_conv     | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goal_visits_until_conv   | 49523 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsCategory      | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsCategory_Cart | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsName          | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsName_Cart     | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsSku           | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsSku_Cart      | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goal_days_until_conv     | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goal_visits_until_conv   | 49519 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsCategory      | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsCategory_Cart | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsName          | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsName_Cart     | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsSku           | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsSku_Cart      | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goal_days_until_conv     | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goal_visits_until_conv   | 49511 |
|      1 | 2020-01-09 | 2020-01-09 |      1 | Goals_ItemsCategory      | 49505 |
+--------+------------+------------+--------+--------------------------+-------+
25 rows in set (2 min 47.919 sec)

Attempting to purge still dies:

wodby<a class='mention' href='https://github.com/php'>@php</a>.container:/var/www/html $ time /usr/local/bin/php /var/www/html/public/console core:purge-old-archive-data all;
Purging outdated archives for 2008_07...Done. [Time elapsed: 0.003s]
Purging outdated archives for 2008_08...Done. [Time elapsed: 0.000s]
Purging outdated archives for 2008_09...Done. [Time elapsed: 0.000s]
...snip....
Purging invalidated archives for 2019_11...Done. [Time elapsed: 0.133s]
Purging invalidated archives for 2019_12...Done. [Time elapsed: 0.135s]
Purging invalidated archives for 2020_01...INFO [2021-05-27 18:09:56] 2184  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-05-27 18:09:56] 2184  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-05-27 18:09:56] 2184  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-05-27 18:09:57] 2184  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-05-27 18:09:57] 2184  GROUP_CONCAT cut off the query result, you may have to purge archives again.
...etc.

mariadb group_concat_max_len:

MariaDB [(none)]> SHOW VARIABLES LIKE '%group_concat_max_len%';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| group_concat_max_len | 1048576 |
+----------------------+---------+
1 row in set (0.001 sec)
@diosmosis commented on May 27th 2021 Member

@mkindred Does the count of duplicates stay the same after running the purge command? Or better, does the count of total rows in the archive table decrease at all?

@mkindred commented on May 27th 2021

The duplicate rows remain the same after each purge. The total row counts for matomo_archive_blob_2020_01 and matomo_archive_numeric_2020_01 seem to fluctuate up a bit, then back down to 12.2M rows.

@diosmosis commented on May 27th 2021 Member

@mkindred I guess there are just far too many duplicate archives for the current code to handle... Would you be able to apply this patch: https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives?expand=1 then try and purge again?

@mkindred commented on August 5th 2021

Apologies for dropping out for a bit. Work and life got busy.

I have updated to 4.4.1, and I've applied the patch from https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives?expand=1

There is no change. I still have millions of duplicate rows, and when I run core:purge-old-archive-data all, I get 118 warnings about the GROUP_CONCAT issue during purging of invalidated archives for 2020_01.

Is there any sort of diagnostic output I can send to help troubleshoot?

@diosmosis commented on August 5th 2021 Member

@mkindred did the amount of duplicate rows lower after running the command?

EDIT: also what exact warnings do you see?

@mkindred commented on August 5th 2021

No, the duplicate rows seem to have remained the same:

MariaDB [php]> SELECT
    ->   idsite, date1, date2, period, name,
    ->   COUNT(*) as count
    -> FROM matomo_archive_blob_2020_01
    -> GROUP BY idsite, date1, date2, period, name
    -> HAVING count > 1 ORDER BY count DESC
    -> LIMIT 25;
+--------+------------+------------+--------+--------------------------+-------+
| idsite | date1      | date2      | period | name                     | count |
+--------+------------+------------+--------+--------------------------+-------+
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsCategory      | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsCategory_Cart | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsName          | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsName_Cart     | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsSku           | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goals_ItemsSku_Cart      | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goal_days_until_conv     | 49523 |
|      1 | 2020-01-06 | 2020-01-06 |      1 | Goal_visits_until_conv   | 49523 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsCategory      | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsCategory_Cart | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsName          | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsName_Cart     | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsSku           | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goals_ItemsSku_Cart      | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goal_days_until_conv     | 49519 |
|      1 | 2020-01-07 | 2020-01-07 |      1 | Goal_visits_until_conv   | 49519 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsCategory      | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsCategory_Cart | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsName          | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsName_Cart     | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsSku           | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goals_ItemsSku_Cart      | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goal_days_until_conv     | 49511 |
|      1 | 2020-01-08 | 2020-01-08 |      1 | Goal_visits_until_conv   | 49511 |
|      1 | 2020-01-09 | 2020-01-09 |      1 | Goals_ItemsCategory      | 49505 |
+--------+------------+------------+--------+--------------------------+-------+
25 rows in set (2 min 57.524 sec)

The errors during /usr/local/bin/php /var/www/html/public/console core:purge-old-archive-data all; are:

Purging invalidated archives for 2019_09...Done. [Time elapsed: 0.146s]
Purging invalidated archives for 2019_10...Done. [Time elapsed: 0.148s]
Purging invalidated archives for 2019_11...Done. [Time elapsed: 0.136s]
Purging invalidated archives for 2019_12...Done. [Time elapsed: 0.144s]
Purging invalidated archives for 2020_01...INFO [2021-08-05 19:15:42] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:42] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:42] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:43] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:43] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:43] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:44] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:44] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:44] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:45] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:45] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:45] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:46] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:46] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:46] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:46] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:47] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:47] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:47] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:48] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:48] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:48] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:48] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:49] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:49] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:49] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:50] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:50] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:50] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:50] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:51] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:51] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:51] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:52] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:52] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:52] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:53] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:53] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:53] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:53] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:54] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:54] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:54] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:55] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:55] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:55] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:55] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:56] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:56] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:57] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:57] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:57] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:58] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:58] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:59] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:59] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:15:59] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:16:00] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
INFO [2021-08-05 19:16:00] 129  GROUP_CONCAT cut off the query result, you may have to purge archives again.
Done. [Time elapsed: 131.573s]
Purging invalidated archives for 2020_02...Done. [Time elapsed: 17.761s]
Purging invalidated archives for 2020_03...Done. [Time elapsed: 6.507s]
Purging invalidated archives for 2020_04...Done. [Time elapsed: 4.443s]
Purging invalidated archives for 2020_05...Done. [Time elapsed: 5.042s]
@diosmosis commented on August 5th 2021 Member

@mkindred on this line: https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives?expand=1#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R119 can you add:

$this->logger->info("Found {archiveIdCount} archives to purge", ['archiveIdCount' => count($archiveIds)]);

Then run the purge and see what shows up?

@mkindred commented on August 5th 2021

Attached is the output of the purge command after adding that line.

piwik_purge_2108052011.log

@diosmosis commented on August 5th 2021 Member

@mkindred ok, that's very confusing. Would you be able to add another log here: https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives?expand=1#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61L83

$this->logger->info("row: {row}", ['row' => print_r($row, true)]);

and run the command again?

@mkindred commented on August 5th 2021

It takes 30 minutes to run the purge all. Should I add a parameter to narrow that down to a timeframe?

ETA: I am going to run /usr/local/bin/php /var/www/html/public/console core:purge-old-archive-data 2020-01-01

@mkindred commented on August 5th 2021

Output of /usr/local/bin/php /var/www/html/public/console core:purge-old-archive-data 2020-01-01:

piwik_purge_2108052043.log

@diosmosis commented on August 5th 2021 Member

@mkindred Ok, that provides some context. Can you run the following queries?

SELECT COUNT(*) FROM archive_numeric_2020_01 WHERE name LIKE 'done%' and value = 5;
SELECT COUNT(*) FROM archive_numeric_2020_02 WHERE name LIKE 'done%' and value = 5;
SELECT DISTINCT name FROM archive_blob_2020_01 WHERE idarchive IN (3139154, 3042212, 2847113);

I've also updated the branch with another change: https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives#diff-732c8b0c81b3c11bec111dfb48dbb289e3b2232428d012dd94cad00634c35d61R105-R112 . Can you make this change and run the purge again on a single month?

@mkindred commented on August 6th 2021

@diosmosis I reverted back to 4.4.1 and applied https://github.com/matomo-org/matomo/compare/large-amount-of-purgable-archives.patch

sql queries: sql_query_output.txt

Purge log /usr/local/bin/php /var/www/html/public/console core:purge-old-archive-data 2020-01-01: piwik_purge_2108061553.log

@diosmosis commented on August 6th 2021 Member

@mkindred From the data I think you're being affected by the after effects of a couple bugs fixed in 4.3.

Can you answer these questions:

  1. In the past were there many edits to a Funnel entity?
  2. Do you still have log data for data from 2020-01-01 onwards?
@mkindred commented on August 6th 2021

I can't recall ever using funnels, except to test them out when I first installed Piwik.

The issues (unending archive processes and growing archive tables) began shortly after I created a couple custom reports (one of which involved goals, if memory serves) and then subsequently invalidated reports for a timeframe of 6 months (approx 10-2020 through 04-2021).

The archive processes began running normally (and the tables stopped growing) only after I deleted the two custom reports.

Yes, I still have all raw data from 09-2019 forward. And I'm currently running my purge tests on a local docker instance.

@mkindred commented on August 6th 2021

Side note: I stopped automatic deletion of raw data after I accidentally invalidated reports for a time period that included a time without raw data... obviously leaving a big hole in my reports. Luckily I was able to piece it back together from various backups, but it made me cautious to delete raw data.

@diosmosis commented on August 6th 2021 Member

@mkindred ok, the easiest fix would be to just delete/truncate the archive tables, invalidate the dates, archive again, then check the reports, eg:

  1. delete/truncate archive_*_2020_01
  2. ./console core:invalidate-report-data --sites={idSite} --periods=day --dates=2020-01-01,2020-01-31
  3. ./console core:archive
  4. then load reports in the UI and ensure report data is there

This could be done on your test instance first, then, after a backup is made of the extremely large database (just in case), done on production.

If for some reason the tables can't be deleted (for example, because you don't have raw data there), then it's a bit more complicated, but still do-able.

@mkindred commented on August 6th 2021

@diosmosis Makes sense, but I have a couple questions before I do that.

Setting aside what actions might have triggered this issue, do you think it's fair to say that the biggest hindrance to reducing the bloat has been the sheer number of duplicate rows?

Since I have raw data for all of 2020 and 2021, I'm considering removing all archive tables from 2020 and 2021 and invalidating --dates=2020-01-01,2021-08-06 just to get everything back to normal. Do you see a problem with doing this?

Should I remove the patch(es) before invalidating and archiving?

@diosmosis commented on August 6th 2021 Member

Setting aside what actions might have triggered this issue, do you think it's fair to say that the biggest hindrance to reducing the bloat has been the sheer number of duplicate rows?

Short answer "yes". Long answer, there are two issues at play:

  1. There are a ton of archives marked as "partial" archives that shouldn't be. Partial archives are supposed to contain one or a couple reports and are created automatically when someone changes an entity. For example, you change a custom report, 6 months of past data are invalided, but just for that report. When they are archived, we call those archives "partial archives". The implication here, is that since it only contains some data, we can't delete it if it is newer than the most recent archive with all report data. In your database there are many archives marked as partial but containing every report. So they have all the data needed for a normal "all plugins" archive, but since they are marked incorrectly we think they only have some data and can't delete it.

  2. Some partial archives can be deleted, but for some reason they are simply not being queried correctly. This is causing the GROUP_CONCAT warnings, but that warning should mean that some of them were found and able to be deleted. But from the logs we added, nothing's being deleted. This is probably caused by the above reason as well: there are too many archives like the ones above getting in the way of detecting the older ones. As in, the result set (which is ordered by date of archive creation in descending order) is: [seemingly valid partial archives..., single non-partial archive, cut-off query result where the deletable archives would be].

Since I have raw data for all of 2020 and 2021, I'm considering removing all archive tables from 2020 and 2021 and invalidating --dates=2020-01-01,2021-08-06 just to get everything back to normal. Do you see a problem with doing this?

It might be a good idea to test the process first on one month, just since re-archiving everything would take a while, but if you have the raw data, it shouldn't be an issue. Archive data is just computed from raw data so in most cases it's meant to be deletable.

Should I remove the patch(es) before invalidating and archiving?

It's not strictly necessary, but probably a good idea. I'll be creating an official change that will get reviewed next week.

@mkindred commented on August 6th 2021

@diosmosis Thanks for all your help. I was able to delete, invalidate, and archive for 2020, and then also for 2020-2021 on my local instance. Now I'll test on a staging site prior to trying the same on production.

Powered by GitHub Issue Mirror