Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old data in log tables without idvisit column are not purged (affects plugins) #16529

Closed
mattab opened this issue Oct 5, 2020 · 5 comments · Fixed by #17964
Closed

Old data in log tables without idvisit column are not purged (affects plugins) #16529

mattab opened this issue Oct 5, 2020 · 5 comments · Fixed by #17964
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Oct 5, 2020

According to customers in https://forum.matomo.org/t/form-analytics-doesnt-cleanup-its-old-form-field-form-page-log-data/38567

I’m using Matomo (3.14.0) with Form Analytics (3.1.28). Both “delete old raw data” and “delete old aggregated report data” policies are in place (with core:archive cron job). Still, in practice, I see that Form Analytics purges records from the log_form db table only, leaving log_form_field and log_form_page tables filled with all the historical data from the beginning of time.
I can confirm that old records are correctly purged from log_form but not from log_form_page/log_form_field tables, matching them by idvisit/idlogform. How can I clean that old data and prevent its accumulation in the future?

We would expect that log_form_page and log_form_field are purged indeed.

from @tsteur this is supposed to be done by core but it looks like it’s not fully implemented there for log tables without idvisit column.

@mattab mattab added Bug For errors / faults / flaws / inconsistencies etc. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. labels Oct 5, 2020
@mattab mattab added this to the 4.3.0 milestone Oct 5, 2020
@mattab mattab changed the title Old data in log tables without idvisit column may not always be purged (affects plugins) Old data in log tables without idvisit column are not purged (affects plugins) Oct 5, 2020
@tsteur
Copy link
Member

tsteur commented Aug 22, 2021

To reproduce this issue have Form Analytics installed and track data.

Then configure log data deletion as defined in https://matomo.org/faq/troubleshooting/faq_42/ . At least in the beginning you will want to configure a high minimum amount of days like 5000 in order to not delete all your tracked log data.

When the task runs to delete old log data, which you can do eg using ./console core:run-scheduled-tasks --force "Piwik\Plugins\PrivacyManager\Tasks.deleteLogData"

You will then notice in LogDeleter::deleteVisits() while debugging the code that it will try to delete the data from the log_form tables, but not from the tables that log_form references to which is eg log_form_page and log_form_field. Other custom tracker plugins may have similar problems.

It's only deleting data from tables that have an idvisit column but the other form tables don't have this. instead they reference to log_form using the idlogform table.

on the contrary in https://github.com/matomo-org/matomo/blob/4.4.1/plugins/PrivacyManager/Model/DataSubjects.php#L72-L77 where we have the logic for deleting individual visit data it does delete data from these tables.

Not sure if any of the logic can be reused as at least the DataSubjects API doesn't have any limits on how many visits would be deleted at once and there may be a risk that these log tables would be locked for too long possibly causing downtime. But maybe it can be reused as long as not too many visits are being sent to the method (which may be already the case).

We need to make sure that we delete the correct data from all the needed tables when log purging runs. At the same time we need to make sure these tables won't be locked for too long as otherwise there will be server issues while trying to track data into these tables.

@tsteur tsteur added the Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement. label Aug 22, 2021
@geekdenz
Copy link
Contributor

At least in the beginning you will want to configure a high minimum amount of days like 5000 in order to not delete all your tracked log data.

Does that mean to reproduce I will need to create data going back more than 5000 days? Would the VisitorGenerator be the tool to use for this or another way I would need to automate or do manually?

I currently have no form data logs and would create a form, submit it and see what it logs. Then I would try to repeat the request with some different parameters to figure out how it all works.

I debugged the script like so:

php -dxdebug.start_with_request=yes ./console core:run-scheduled-tasks --force "Piwik\Plugins\PrivacyManager\Tasks.deleteLogData"

but it didn't hit LogDeleter::deleteVisits().

Might this be part of the problem or does it hit LogDeleter::deleteVisits() when there are logs only?

Answering partly my own question:

matomo/core/LogDeleter.php

Lines 98 to 101 in 0e34030

$this->rawLogDao->forAllLogs('log_visit', $fields, $conditions, $iterationStep, function ($logs) use ($logPurger, &$logsDeleted, $afterChunkDeleted) {
$ids = array_map(function ($row) { return (int) (reset($row)); }, $logs);
sort($ids);
$logsDeleted += $logPurger->deleteVisits($ids);

seems to call this method only if there are logs.

The question that remains though what is the most effective and fastest way to generate the logs to be deleted?

@mattab
Copy link
Member Author

mattab commented Aug 31, 2021

The question that remains though what is the most effective and fastest way to generate the logs to be deleted?

I think the Visitor Generator which purpose is to generate test logs? it's pretty slow but it should work fine..

@geekdenz
Copy link
Contributor

Thanks @mattab .

For future reference, I use this command:

./console development:disable; ./console visitorgenerator:generate-visits --start-date 2007-12-20 --idsite 1 --days 6 to overlap with 5000 days ago and to avoid it taking too long.

@tsteur
Copy link
Member

tsteur commented Aug 31, 2021

@geekdenz you could also set a different number of days like 300 or so. You just want to probably not delete any recent data by accident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement.
Projects
None yet
4 participants