@sgiehl opened this Pull Request on July 2nd 2020 Member

This PR aims to fix the behaviour for segments using "not equals" or "not contains" on an action dimension as described in #13785.

@tsteur I tried to implement it like you suggested in https://github.com/matomo-org/matomo/issues/13785#issuecomment-652023031, but that actually caused a lot problems with segments that contain a union of segments.

When a segment is initialized the union of segments would by default be splitted up into the union of segments. But those segment union needs to be in a single subquery as it would have an incorrect results otherwise. For now I've put the code to generate that subquery into the segment initialize, which requires the period to be available. Not an optimal solution as a segment should not require a period to be initialized, but guess we would need to refactor the whole usage of the Segment class otherwise, as we need to do that stuff while parsing, but the parsing is also needed for methods that are independent from the period...

At least it seems to work the way it's currently implemented. A segment like actionUrl!<a class='mention' href='https://github.com/piwik'>@piwik</a>;browserName!<a class='mention' href='https://github.com/Chrome'>@Chrome</a> generates a query for the visitor log like:

/* idSegments = [5] */
SELECT log_visit.*
FROM matomo_log_visit AS log_visit
WHERE (log_visit.idsite IN (?)
       AND log_visit.visit_last_action_time >= ?
       AND log_visit.visit_last_action_time <= ?)
  AND ((log_visit.idvisit NOT IN
          (SELECT /* 2020-06-29,2020-07-01 sites 1 segmenthash 4e6fb538924981bfe0f0bd7e9440df3b */ log_inner.idvisit
           FROM
             (SELECT log_visit.idvisit
              FROM matomo_log_visit AS log_visit
              LEFT JOIN matomo_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
              WHERE (log_visit.visit_last_action_time >= ?
                     AND log_visit.visit_last_action_time <= ?
                     AND log_visit.idsite IN (?))
                AND (((log_link_visit_action.idaction_url IN
                         (SELECT idaction
                          FROM matomo_log_action
                          WHERE (name LIKE CONCAT('%', ?, '%')
                                 AND TYPE = 1)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 3)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 2)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 10)))))
              GROUP BY log_visit.idvisit
              ORDER BY NULL) AS log_inner
           GROUP BY log_inner.idvisit))
       AND (log_visit.config_browser_name IS NULL
            OR log_visit.config_browser_name NOT LIKE ?))
GROUP BY log_visit.idvisit
ORDER BY log_visit.idsite DESC,
         log_visit.visit_last_action_time DESC
LIMIT 0,
      12
@sgiehl commented on July 6th 2020 Member

@tsteur I've done some adjustments and removed the usage of LogAggregator in that case. Also added some tests. Maybe you could have another look and check if there might be some cases that might not work anymore...

@tsteur commented on July 7th 2020 Member

@sgiehl

it generated for me a query like this one:

/* idSegments = [61] */

            SELECT
                log_visit.*
            FROM
                matomo_log_visit AS log_visit
            WHERE
                ( log_visit.idsite in (?) 
                AND log_visit.visit_last_action_time >= ? )
                AND
                ( ( log_visit.idvisit NOT IN (/* idSegments = [56] */

            SELECT
                log_inner.idvisit
            FROM

        (

            SELECT
                log_visit.idvisit
            FROM
                matomo_log_visit AS log_visit LEFT JOIN matomo_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.idsite IN (?) AND log_visit.visit_last_action_time >= ? )
                AND
                ( ( log_link_visit_action.idaction_url IN (SELECT idaction FROM matomo_log_action WHERE ( name LIKE CONCAT('%', ?, '%')  AND type = 1 )) ) )
            GROUP BY
                log_visit.idvisit
            ORDER BY
                NULL
        ) AS log_inner) ) )
            GROUP BY
                log_visit.idvisit
            ORDER BY
                log_visit.idsite DESC, log_visit.visit_last_action_time DESC LIMIT 0, 12

I reckon we could remove one of the inner subquery similar to how we do it in the LogAggregator:

            $logQueryBuilder = StaticContainer::get('Piwik\DataAccess\LogQueryBuilder');
            $forceGroupByBackup = $logQueryBuilder->getForcedInnerGroupBySubselect();
            $logQueryBuilder->forceInnerGroupBySubselect(LogQueryBuilder::FORCE_INNER_GROUP_BY_NO_SUBSELECT);
            $query = $segmentObj->getSelectQuery($select, $from, $where, $bind); // could also catch this
            $logQueryBuilder->forceInnerGroupBySubselect($forceGroupByBackup);

This should get rid of one subquery.

Otherwise looks good to me so far @sgiehl

@sgiehl commented on July 7th 2020 Member

@tsteur updated the code & tests. Should now be ready for a final review

@tsteur commented on July 8th 2020 Member

Looks generally good.

Is the startDate/endDate maybe missing from core/Archive/ArchiveQueryFactory.php::getSegmentFromQueryParam?

Or can we maybe not return true for $this->doesSegmentNeedSubquery() when no startDate/endDate is given? I just applied a segment and it generated this where it would basically result in a full table scan if this was actually executed:

image

@sgiehl commented on July 8th 2020 Member

Is the startDate/endDate maybe missing from core/Archive/ArchiveQueryFactory.php::getSegmentFromQueryParam?

Actually when looking through the code for me it seems the segment query shouldn't be used when loading an Archive. But can set those parameters to be sure

@tsteur commented on July 9th 2020 Member

@sgiehl can we still add $this->doesSegmentNeedSubquery to return true only when a start and end date is set? I wan't to prevent by accident it doing a full table scan which could be basically the death for any DB. Or alternatively we need to require a start and end date to be always set.

@sgiehl commented on July 9th 2020 Member

@tsteur I've adjusted the code, so it will no longer use a subquery if no start date is given. If that is the case and a segment query is built, that would require a subquery a warning will be logged.
Maybe we could remove the logging after beta cycle or with a later release, but it might be helpful to check if we have any segment class usages that would require setting a date...

This Pull Request was closed on July 9th 2020
Powered by GitHub Issue Mirror