@mattab opened this Issue on May 14th 2018 Member

Something that is probably not expected by users, is that when you delete a website all the tracking data is still kept in the database.

-> When website has been deleted, we should also delete the raw data for this website

Maybe this could be handled as a scheduled task, because deleting data will take a long time.

I reckon we need to address this soon so moving tentatively to 3.6.0 milestone.

Refs #12368 ("Soft delete" a website in the site table)

@sgiehl commented on June 25th 2018 Member

So what would be the final state we like to achive?
Should it only be possible to soft delete a website in the UI, and we would have an additional command to completely truncate a soft deleted website (which will also remove the raw data)?

@mattab commented on June 25th 2018 Member

Ideally we would:

  1. "soft delete" the websites (covered in https://github.com/matomo-org/matomo/issues/12368)
  2. have a scheduled task that deletes all RAW data that refers to invalid site IDs (ie. deleted sites)

This issue is only about 2. where we need a new scheduled task that would somehow automatically delete all data for invalid/deleted websites. Hopefully this SQL query (which will have some JOIN on the websites table) can be made fast on very large log_ tables.

Edit: It would be good if we can also do #12368 at the same time as this issue, if possible, but it's not required.

@diosmosis commented on July 24th 2018 Member

Note: when deleting a site there should be a message clarifying that the site's data will be permanently deleted.

@sgiehl commented on July 25th 2018 Member

To simply remove all data from all tables that refer to an invalid site something like this could be used:

class PurgeRemovedSites extends ConsoleCommand
{
    protected function configure()
    {
        $this->setName('core:purge-removed-sites');
        $this->setDescription('Purges raw and log data for already removed sites.');
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $allTables = DbHelper::getTablesInstalled();

        $sitesModel = new \Piwik\Plugins\SitesManager\Model();

        $existingIdSites = array_column($sitesModel->getAllSites(), 'idsite');

        foreach ($allTables as $table) {
            $columns = array_column(Db::fetchAll("SHOW COLUMNS FROM " . $table), 'Field');

            if (in_array('idsite', $columns)) {
                print('DELETE FROM ' . $table . ' WHERE idsite NOT IN (' . implode(', ', $existingIdSites) . ')');
            }
        }
    }
}

But that won't remove any data that is related any deeper. like data in log_action where some idaction might not be in use anymore.

@tsteur commented on July 26th 2018 Member

FYI: Not sure if it helps but in https://github.com/matomo-org/matomo/blob/3.5.1/plugins/PrivacyManager/Model/DataSubjects.php#L36-L89 we have some logic for log tables to also delete related data that may not have an idsite column. I presume similar logic can be used there maybe.

This Issue was closed on August 6th 2018
Powered by GitHub Issue Mirror