Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New index for speeding up anonymizations in large instances? #16856

Closed
leaf-node opened this issue Dec 2, 2020 · 4 comments
Closed

New index for speeding up anonymizations in large instances? #16856

leaf-node opened this issue Dec 2, 2020 · 4 comments
Labels
c: Performance For when we could improve the performance / speed of Matomo.

Comments

@leaf-node
Copy link

Hi, thanks for Matomo. It's really great software : )

So I tried to anonymize past data in my site for the largest time frame (back through 2010), and that caused the site to become unresponsive, which is understandable given that we have 14M rows on one of the tables queried by that process. According to MySQL, one of the queries for that was in a process of "Sorting result" for quite a long time.

While I will likely be re-trying with shorter time periods, it would make good sense if the table has an index that covers the queries being performed. Given the way that the query is written, with an offset and a limit, I'm not sure whether the beginning of successively larger chunks of the DB will have to be sorted for each chunk of processing, even with the index.

The query that was taking a long time was:

| 651902 | matomo | localhost:58828 | matomo | Execute |  800 | Sorting result               | SELECT idsite, idvisit, location_ip, user_id, location_longitude, location_latitude, location_city, location_region, location_country FROM matomo_log_visit WHERE idsite in (1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18) and visit_last_action_time >= '2010-01-01 00:00:00' and visit_last_action_time <= '2020-12-03 23:59:59' ORDER BY idsite, visit_last_action_time, idvisit LIMIT 10000 OFFSET 0

And the index I created was the following:

MariaDB [matomo]> create index sudoman_fix on matomo_log_visit (idsite, visit_last_action_time, idvisit);
Query OK, 14199958 rows affected (13 min 29.64 sec)    ne
Records: 14199958  Duplicates: 0  Warnings: 0

As you can see, it took 13 minutes to add it to a table with 14M rows, during which the site was unresponsive.

I don't know if there are other tables that could use additional indexes, but I imagine that this one could speed up anonymization of batches in shorter time periods too.

As an another note, I originally killed the long running query, but the Web interface lists that job as "In Progress". I'm not sure if that means that it will try again, or if it just doesn't know that it failed.

Anyhow, thanks a bunch. : )

@leaf-node
Copy link
Author

I'm using Matomo 4.0.3, and PHP 7.2.

@tsteur tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label Dec 2, 2020
@tsteur tsteur added this to the Priority Backlog (Help wanted) milestone Dec 2, 2020
@tsteur
Copy link
Member

tsteur commented Dec 2, 2020

Thanks for letting us know about this @sudoman

Very appreciated the investigation you already did. I reckon just for this use case we likely won't add an index since the feature is rarely used but would make things slower on the tracking side of things etc. And there is already an index on idsite, visit_last_action_time. The problem being the idvisit though.

Generally it be of course faster with smaller date ranges and maybe the anonymiser could be instead made faster to iterate only from month to month or so. Of course things could still be a bit slow if most of the visits were in the most recent months but it should overall speed things up quite a bit I suppose.

@leaf-node
Copy link
Author

Okay, going without this index sounds fair. I do wonder if that section of the site isn't used much simply because an organization may make a decision about changing their anonymization policies once, then only convert old data that one time. That would still make it a useful feature in Matomo.

As for the details about the index itself, I don't know much about SQL indexes, but I believe that specifying multiple columns means that a single index covers all of them. That would be important when sorting based on three column names. Otherwise, the table would be quickly sorted by the first index, then the rest of the sort would be slower.

I decided to go back to the early history in our site and to start doing them in one year chunks. I watched the MySQL process list, and the relevant query was completing very fast, as in a few seconds or less with the proposed index. We'll see how things work out for more recent years that have more data.

Thanks : )

@mattab
Copy link
Member

mattab commented Dec 11, 2023

Thanks for contributing to this issue. As it has been a few months since the last activity and we believe this is likely not an issue anymore, we will now close this. If that's not the case, please do feel free to either reopen this issue or open a new one. We will gladly take a look again!

@mattab mattab closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo.
Projects
None yet
Development

No branches or pull requests

3 participants