@leaf-node opened this Issue on December 2nd 2020

Hi, thanks for Matomo. It's really great software : )

So I tried to anonymize past data in my site for the largest time frame (back through 2010), and that caused the site to become unresponsive, which is understandable given that we have 14M rows on one of the tables queried by that process. According to MySQL, one of the queries for that was in a process of "Sorting result" for quite a long time.

While I will likely be re-trying with shorter time periods, it would make good sense if the table has an index that covers the queries being performed. Given the way that the query is written, with an offset and a limit, I'm not sure whether the beginning of successively larger chunks of the DB will have to be sorted for each chunk of processing, even with the index.

The query that was taking a long time was:

| 651902 | matomo | localhost:58828 | matomo | Execute |  800 | Sorting result               | SELECT idsite, idvisit, location_ip, user_id, location_longitude, location_latitude, location_city, location_region, location_country FROM matomo_log_visit WHERE idsite in (1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18) and visit_last_action_time >= '2010-01-01 00:00:00' and visit_last_action_time <= '2020-12-03 23:59:59' ORDER BY idsite, visit_last_action_time, idvisit LIMIT 10000 OFFSET 0

And the index I created was the following:

MariaDB [matomo]> create index sudoman_fix on matomo_log_visit (idsite, visit_last_action_time, idvisit);
Query OK, 14199958 rows affected (13 min 29.64 sec)    ne
Records: 14199958  Duplicates: 0  Warnings: 0

As you can see, it took 13 minutes to add it to a table with 14M rows, during which the site was unresponsive.

I don't know if there are other tables that could use additional indexes, but I imagine that this one could speed up anonymization of batches in shorter time periods too.

As an another note, I originally killed the long running query, but the Web interface lists that job as "In Progress". I'm not sure if that means that it will try again, or if it just doesn't know that it failed.

Anyhow, thanks a bunch. : )

@leaf-node commented on December 2nd 2020

I'm using Matomo 4.0.3, and PHP 7.2.

@tsteur commented on December 2nd 2020 Member

Thanks for letting us know about this @sudoman

Very appreciated the investigation you already did. I reckon just for this use case we likely won't add an index since the feature is rarely used but would make things slower on the tracking side of things etc. And there is already an index on idsite, visit_last_action_time. The problem being the idvisit though.

Generally it be of course faster with smaller date ranges and maybe the anonymiser could be instead made faster to iterate only from month to month or so. Of course things could still be a bit slow if most of the visits were in the most recent months but it should overall speed things up quite a bit I suppose.

@leaf-node commented on December 2nd 2020

Okay, going without this index sounds fair. I do wonder if that section of the site isn't used much simply because an organization may make a decision about changing their anonymization policies once, then only convert old data that one time. That would still make it a useful feature in Matomo.

As for the details about the index itself, I don't know much about SQL indexes, but I believe that specifying multiple columns means that a single index covers all of them. That would be important when sorting based on three column names. Otherwise, the table would be quickly sorted by the first index, then the rest of the sort would be slower.

I decided to go back to the early history in our site and to start doing them in one year chunks. I watched the MySQL process list, and the relevant query was completing very fast, as in a few seconds or less with the proposed index. We'll see how things work out for more recent years that have more data.

Thanks : )

Powered by GitHub Issue Mirror