Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When website has been deleted, we should also delete the raw data for this website #12907

Closed
mattab opened this issue May 14, 2018 · 5 comments
Closed
Assignees
Labels
c: Privacy For issues that impact or improve the privacy. c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@mattab
Copy link
Member

mattab commented May 14, 2018

Something that is probably not expected by users, is that when you delete a website all the tracking data is still kept in the database.

-> When website has been deleted, we should also delete the raw data for this website

Maybe this could be handled as a scheduled task, because deleting data will take a long time.

I reckon we need to address this soon so moving tentatively to 3.6.0 milestone.

Refs #12368 ("Soft delete" a website in the site table)

@mattab mattab added c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github. c: Privacy For issues that impact or improve the privacy. labels May 14, 2018
@mattab mattab added this to the 3.6.0 milestone May 14, 2018
@sgiehl
Copy link
Member

sgiehl commented Jun 25, 2018

So what would be the final state we like to achive?
Should it only be possible to soft delete a website in the UI, and we would have an additional command to completely truncate a soft deleted website (which will also remove the raw data)?

@mattab
Copy link
Member Author

mattab commented Jun 25, 2018

Ideally we would:

  1. "soft delete" the websites (covered in When a website is deleted, it should be "soft deleted" rather than completely deleted #12368)
  2. have a scheduled task that deletes all RAW data that refers to invalid site IDs (ie. deleted sites)

This issue is only about 2. where we need a new scheduled task that would somehow automatically delete all data for invalid/deleted websites. Hopefully this SQL query (which will have some JOIN on the websites table) can be made fast on very large log_ tables.

Edit: It would be good if we can also do #12368 at the same time as this issue, if possible, but it's not required.

@mattab mattab added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Jun 28, 2018
@sgiehl sgiehl self-assigned this Jul 9, 2018
@diosmosis
Copy link
Member

Note: when deleting a site there should be a message clarifying that the site's data will be permanently deleted.

@sgiehl
Copy link
Member

sgiehl commented Jul 25, 2018

To simply remove all data from all tables that refer to an invalid site something like this could be used:

class PurgeRemovedSites extends ConsoleCommand
{
    protected function configure()
    {
        $this->setName('core:purge-removed-sites');
        $this->setDescription('Purges raw and log data for already removed sites.');
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $allTables = DbHelper::getTablesInstalled();

        $sitesModel = new \Piwik\Plugins\SitesManager\Model();

        $existingIdSites = array_column($sitesModel->getAllSites(), 'idsite');

        foreach ($allTables as $table) {
            $columns = array_column(Db::fetchAll("SHOW COLUMNS FROM " . $table), 'Field');

            if (in_array('idsite', $columns)) {
                print('DELETE FROM ' . $table . ' WHERE idsite NOT IN (' . implode(', ', $existingIdSites) . ')');
            }
        }
    }
}

But that won't remove any data that is related any deeper. like data in log_action where some idaction might not be in use anymore.

@sgiehl sgiehl removed their assignment Jul 26, 2018
@tsteur
Copy link
Member

tsteur commented Jul 26, 2018

FYI: Not sure if it helps but in https://github.com/matomo-org/matomo/blob/3.5.1/plugins/PrivacyManager/Model/DataSubjects.php#L36-L89 we have some logic for log tables to also delete related data that may not have an idsite column. I presume similar logic can be used there maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Privacy For issues that impact or improve the privacy. c: Security For issues that make Matomo more secure. Please report issues through HackerOne and not in Github. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

4 participants