@tsteur opened this Issue on October 21st 2022 Member

Given a segment like this:

pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Ftest.matomo.org%252F;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fevent;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fsearch;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Forga;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Ffoo;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fbar;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Ftesdlweke;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fhello-world;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fvideo;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fblog;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252F...;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fabout;pageUrl!<a class='mention' href='https://github.com/https'>@https</a>%253A%252F%252Fmatomo.org%252Fcart

The resulting query takes many hours to execute even on a smaller data set (on Aurora). In this case there are around:

  • 75K log_action entries
  • 20K log_visit entries
  • 1M log_link_visit_action entries

It's not too much data and wouldn't be expected that this query takes that long.

The resulting where clause looks roughly like this:

archive

See https://github.com/matomo-org/matomo/issues/19581 and https://github.com/matomo-org/matomo/issues/8850 and https://github.com/matomo-org/matomo/issues/8867 we know that action name queries are slow. However, we could see a big improvement for these kind of queries if we may be able to merge them into one queries instead of multiple full table scans.

Maybe we can do something like where idaction_url in (select idaction from log_action where type = 1 and (name like '%...%' or name like '%...%' or name like '%...%'...)) or if the segment definition was different like where idaction_url in (select idaction from log_action where type = 1 and (name not like '%...%' and name not like '%...%' and name not like '%...%'...)).

The goal would be to minimise the amount of full table scans as much as possible and group the queries together as much as possible so that the query completes a lot faster.

Above is running slow on a custom report with a segment filter attached but the same should be happening if this was a regular segment. Let me know if there are any questions or if anything is unclear.

@bx80 commented on October 26th 2022 Contributor

Hi @tsteur, thanks for the explanation, that makes sense. Would you see this approach as preferable to implementing a segment link table or some sort of persistent segment <-> action cache as discussed in #19581? Or as a short-medium term workaround?

@tsteur commented on October 26th 2022 Member

It depends on the effort. If we could improve this query fairly "easily", then I would implement this one first as it would prevent the worst case where we have not just one slow log_action.name contains but multiple

Powered by GitHub Issue Mirror