@tsteur opened this Issue on July 28th 2022 Member

refs https://github.com/matomo-org/matomo/issues/8850 and https://github.com/matomo-org/matomo/issues/8867

When someone creates a segment like page_url contains '%...%' or page_title contains '%...%' or event_name contains '%...%' then these kind of queries are very slow as it requires a full table scan on log_action. They result in a where clause for example like this ( ( log_link_visit_action.idaction_url IN (SELECT idaction FROM log_action WHERE ( name LIKE CONCAT('%', 'njw', '%') AND type = 10 )) ) ). With a lot of data in log_action table these queries can be extremely slow, cause a lot of CPU usage etc.

As it's one of Matomo's biggest bottleneck, we have investigated a few times in the past whether we can optimise this query but couldn't find any solution.

In https://github.com/matomo-org/matomo/issues/8867 there was a config enable_segments_subquery_cache created but this doesn't help all that much as the cache might become outdated (it caches the data for 1 hour) and the cache re-usage would be limited during that hour as we already use temporary tables when archiving to only needing to execute this query once.

I'm now creating this issue again in case there are other ways to make this faster. Like maybe we can maintain MySQL tables for each such segment definition so that we not only write log_action entries in log_action but also eg in a table log_action_pageurl_contains_foo and we reuse the same log_action IDs etc in that table. This way we wouldn't need to run the contains query at all but could change it to join that table instead. This of course might cause more writes during tracking and not sure this is beneficial or worth it. There might be also other ideas.

@bx80 commented on August 1st 2022 Contributor

It might not be too hard to test the performance benefits of using a segment <-> action link table, something like the following on a large dataset with a complex segment could be tested with and without using the link table to compare performance. The trade off would be efficiently matching new actions created by the tracker to existing segments. Perhaps if we could do this in code then it might just be a single read (get the segment definitions) and a single write (bulk insert into the link table).

-- Migration
CREATE TABLE log_link_segment_action (
  idaction    bigint(10) UNSIGNED,
  idsegment   bigint(10) UNSIGNED,
  PRIMARY KEY (idaction, idsegment)
-- Migration or when a new segment is added (the where clause is the segment definition)
INSERT INTO log_link_segment_action 
(SELECT idaction, 1 AS idsegement FROM log_action WHERE ( name LIKE CONCAT('%', 'njw', '%') AND type = 10 ));
// When a new log_action record is added by the tracker
$segments = Piwik\Plugins\SegmentEditor\AAPI\API::getInstance()->getAll(1);
foreach ($segments as $s) {
    if (matchSegmentInCode($s['definition], $action){
        // INSERT INTO log_link_segment_action (idaction, idsegment) VALUES ($s[id], $idaction);
-- Analytic queries using join instead of expensive contains query
SELECT * FROM log_link_visit_action va
LEFT JOIN log_link_segment_action sa ON sa.idaction = va.idaction_url
WHERE sa.idsegment = 1;
Powered by GitHub Issue Mirror