@mattab opened this Issue on October 5th 2020 Member

According to customers in https://forum.matomo.org/t/form-analytics-doesnt-cleanup-its-old-form-field-form-page-log-data/38567

I’m using Matomo (3.14.0) with Form Analytics (3.1.28). Both “delete old raw data” and “delete old aggregated report data” policies are in place (with core:archive cron job). Still, in practice, I see that Form Analytics purges records from the log_form db table only, leaving log_form_field and log_form_page tables filled with all the historical data from the beginning of time.
I can confirm that old records are correctly purged from log_form but not from log_form_page/log_form_field tables, matching them by idvisit/idlogform. How can I clean that old data and prevent its accumulation in the future?

We would expect that log_form_page and log_form_field are purged indeed.

from @tsteur this is supposed to be done by core but it looks like it’s not fully implemented there for log tables without idvisit column.

@tsteur commented on August 22nd 2021 Member

To reproduce this issue have Form Analytics installed and track data.

Then configure log data deletion as defined in https://matomo.org/faq/troubleshooting/faq_42/ . At least in the beginning you will want to configure a high minimum amount of days like 5000 in order to not delete all your tracked log data.

When the task runs to delete old log data, which you can do eg using ./console core:run-scheduled-tasks --force "Piwik\Plugins\PrivacyManager\Tasks.deleteLogData"

You will then notice in LogDeleter::deleteVisits() while debugging the code that it will try to delete the data from the log_form tables, but not from the tables that log_form references to which is eg log_form_page and log_form_field. Other custom tracker plugins may have similar problems.

It's only deleting data from tables that have an idvisit column but the other form tables don't have this. instead they reference to log_form using the idlogform table.

on the contrary in https://github.com/matomo-org/matomo/blob/4.4.1/plugins/PrivacyManager/Model/DataSubjects.php#L72-L77 where we have the logic for deleting individual visit data it does delete data from these tables.

Not sure if any of the logic can be reused as at least the DataSubjects API doesn't have any limits on how many visits would be deleted at once and there may be a risk that these log tables would be locked for too long possibly causing downtime. But maybe it can be reused as long as not too many visits are being sent to the method (which may be already the case).

We need to make sure that we delete the correct data from all the needed tables when log purging runs. At the same time we need to make sure these tables won't be locked for too long as otherwise there will be server issues while trying to track data into these tables.

@geekdenz commented on August 31st 2021 Contributor

At least in the beginning you will want to configure a high minimum amount of days like 5000 in order to not delete all your tracked log data.

Does that mean to reproduce I will need to create data going back more than 5000 days? Would the VisitorGenerator be the tool to use for this or another way I would need to automate or do manually?

I currently have no form data logs and would create a form, submit it and see what it logs. Then I would try to repeat the request with some different parameters to figure out how it all works.

I debugged the script like so:

php -dxdebug.start_with_request=yes ./console core:run-scheduled-tasks --force "Piwik\Plugins\PrivacyManager\Tasks.deleteLogData"

but it didn't hit LogDeleter::deleteVisits().

Might this be part of the problem or does it hit LogDeleter::deleteVisits() when there are logs only?

Answering partly my own question:

https://github.com/matomo-org/matomo/blob/0e34030c2c29c9908f0cf2fca13db9ac54e62e8c/core/LogDeleter.php#L98-L101

seems to call this method only if there are logs.

The question that remains though what is the most effective and fastest way to generate the logs to be deleted?

@mattab commented on August 31st 2021 Member

The question that remains though what is the most effective and fastest way to generate the logs to be deleted?

I think the Visitor Generator which purpose is to generate test logs? it's pretty slow but it should work fine..

@geekdenz commented on August 31st 2021 Contributor

Thanks @mattab .

For future reference, I use this command:

./console development:disable; ./console visitorgenerator:generate-visits --start-date 2007-12-20 --idsite 1 --days 6 to overlap with 5000 days ago and to avoid it taking too long.

@tsteur commented on August 31st 2021 Member

@geekdenz you could also set a different number of days like 300 or so. You just want to probably not delete any recent data by accident.

This Issue was closed on September 13th 2021
Powered by GitHub Issue Mirror