Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete log data of deleted sites can cause huge spike in connections and CPU usage. #20221

Open
tsteur opened this issue Jan 16, 2023 · 4 comments
Labels
Stability For issues that make Matomo more stable and reliable to run for sys admins.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented Jan 16, 2023

Recently, we've seen a huge amount of connections to our database for a short time. This was likely caused by the PrivacyManager task that deletes logs for a deleted site.

We've seen quite a few DELETE log_visit FROM log_visit log_visit WHERE log_visit.idsite = ? and similar queries during that short time. While there is an index on most of the queries, maybe if there's a lot of data to be deleted it may still queue a lot of issues or maybe specific log tables cause locks. It's actually a recurring even that Monday evening we see a higher DB connection and CPU usage for a short time.

Can we look at solutions for this like deleting data in bulk say 100,000 records at a time so that in between afterwards a few requests can be processed?

@tsteur tsteur added Stability For issues that make Matomo more stable and reliable to run for sys admins. To Triage An issue awaiting triage by a Matomo core team member labels Jan 16, 2023
@mattab mattab modified the milestones: Impact Backlog, 5.2.0 Jan 16, 2023
@bx80
Copy link
Contributor

bx80 commented Jan 16, 2023

Other PrivacyManager scheduled tasks such as deleteReportData() already use Db::deleteAllRows() which breaks the delete command up into chunks of 100,000 records by default.

So the PrivacyManager deleteLogDataForDeletedSites() task which is causing this issue could be adjusted to use the same standard method (probably in DataSubjects->deleteLogDataFrom())

@bx80 bx80 removed the To Triage An issue awaiting triage by a Matomo core team member label Jan 16, 2023
@tsteur
Copy link
Member Author

tsteur commented Mar 6, 2023

We had an issue with this one again today

@tsteur
Copy link
Member Author

tsteur commented Mar 6, 2023

FYI Reading below info in the docs which is maybe why it wasn't implemented

You can specify multiple tables in a DELETE statement to delete rows from one or more tables depending on the condition in the WHERE clause. You cannot use ORDER BY or LIMIT in a multiple-table DELETE. The table_references clause lists the tables involved in the join, as described in Section 13.2.13.2, “JOIN Clause”.

@tsteur
Copy link
Member Author

tsteur commented Mar 10, 2023

FYI We have made a first improvement in #20424

It may now delete some data in chunks. However, it's only actually applied for the log_visit table.

When we delete by idsite or in other cases by idvisit and idsite then there's no need to use a multi table delete statement as most tables actually have these fields.

Meaning we should detect which log tables have eg the idsite table, and if the table does have that table then delete data in chunks without joining another table. This way we will speed up those deletions significantly and can delete the chunks. The same when we delete eg by idvisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stability For issues that make Matomo more stable and reliable to run for sys admins.
Projects
None yet
Development

No branches or pull requests

3 participants