Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix behaviour for segments using "not equals" or "not contains" on an action dimension #16172

Merged
merged 19 commits into from Jul 9, 2020

Conversation

sgiehl
Copy link
Member

@sgiehl sgiehl commented Jul 2, 2020

This PR aims to fix the behaviour for segments using "not equals" or "not contains" on an action dimension as described in #13785.

@tsteur I tried to implement it like you suggested in #13785 (comment), but that actually caused a lot problems with segments that contain a union of segments.

When a segment is initialized the union of segments would by default be splitted up into the union of segments. But those segment union needs to be in a single subquery as it would have an incorrect results otherwise. For now I've put the code to generate that subquery into the segment initialize, which requires the period to be available. Not an optimal solution as a segment should not require a period to be initialized, but guess we would need to refactor the whole usage of the Segment class otherwise, as we need to do that stuff while parsing, but the parsing is also needed for methods that are independent from the period...

At least it seems to work the way it's currently implemented. A segment like actionUrl!@piwik;browserName!@Chrome generates a query for the visitor log like:

/* idSegments = [5] */
SELECT log_visit.*
FROM matomo_log_visit AS log_visit
WHERE (log_visit.idsite IN (?)
       AND log_visit.visit_last_action_time >= ?
       AND log_visit.visit_last_action_time <= ?)
  AND ((log_visit.idvisit NOT IN
          (SELECT /* 2020-06-29,2020-07-01 sites 1 segmenthash 4e6fb538924981bfe0f0bd7e9440df3b */ log_inner.idvisit
           FROM
             (SELECT log_visit.idvisit
              FROM matomo_log_visit AS log_visit
              LEFT JOIN matomo_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
              WHERE (log_visit.visit_last_action_time >= ?
                     AND log_visit.visit_last_action_time <= ?
                     AND log_visit.idsite IN (?))
                AND (((log_link_visit_action.idaction_url IN
                         (SELECT idaction
                          FROM matomo_log_action
                          WHERE (name LIKE CONCAT('%', ?, '%')
                                 AND TYPE = 1)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 3)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 2)))
                      OR (log_link_visit_action.idaction_url IN
                            (SELECT idaction
                             FROM matomo_log_action
                             WHERE (name LIKE CONCAT('%', ?, '%')
                                    AND TYPE = 10)))))
              GROUP BY log_visit.idvisit
              ORDER BY NULL) AS log_inner
           GROUP BY log_inner.idvisit))
       AND (log_visit.config_browser_name IS NULL
            OR log_visit.config_browser_name NOT LIKE ?))
GROUP BY log_visit.idvisit
ORDER BY log_visit.idsite DESC,
         log_visit.visit_last_action_time DESC
LIMIT 0,
      12

@sgiehl sgiehl added the Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. label Jul 2, 2020
@sgiehl sgiehl added this to the 4.0.0 milestone Jul 2, 2020
@sgiehl sgiehl requested a review from tsteur July 2, 2020 15:52
@@ -466,7 +466,7 @@ public function makeLogVisitsQueryString($idSite, $startDate, $endDate, $segment
$filterSortOrder = 'DESC';
}

$segment = new Segment($segment, $idSite);
$segment = new Segment($segment, $idSite, Period\Factory::build('day', implode(',', array_filter([$startDate,$endDate]))));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: That actually does not seem to create the correct period, as the bound parameter for the query when choosing the date range 2020-06-15,2020-06-21 are this:

1 = "2020-06-14 22:00:00"
2 = "2020-06-20 21:59:59"
3 = "2020-06-13 22:00:00"
4 = "2020-06-20 21:59:59"

So the inner query actually has other date values. Need to check why...

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiehl I reckon this goes in the right direction looking good so far 👍

core/Segment.php Outdated Show resolved Hide resolved
core/Segment.php Outdated Show resolved Hide resolved
core/Segment.php Outdated Show resolved Hide resolved
core/Segment.php Outdated Show resolved Hide resolved
core/Segment.php Show resolved Hide resolved
@sgiehl sgiehl force-pushed the notcontainsegments branch 2 times, most recently from e7b545f to b3ad5d9 Compare July 6, 2020 14:52
@sgiehl sgiehl changed the title [POC] Fix behaviour for segments using "not equals" or "not contains" on an action dimension Fix behaviour for segments using "not equals" or "not contains" on an action dimension Jul 6, 2020
@sgiehl
Copy link
Member Author

sgiehl commented Jul 6, 2020

@tsteur I've done some adjustments and removed the usage of LogAggregator in that case. Also added some tests. Maybe you could have another look and check if there might be some cases that might not work anymore...

@tsteur
Copy link
Member

tsteur commented Jul 7, 2020

@sgiehl

it generated for me a query like this one:

/* idSegments = [61] */

			SELECT
				log_visit.*
			FROM
				matomo_log_visit AS log_visit
			WHERE
				( log_visit.idsite in (?) 
				AND log_visit.visit_last_action_time >= ? )
                AND
                ( ( log_visit.idvisit NOT IN (/* idSegments = [56] */

			SELECT
				log_inner.idvisit
			FROM
				
        (
            
			SELECT
				log_visit.idvisit
			FROM
				matomo_log_visit AS log_visit LEFT JOIN matomo_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
			WHERE
				( log_visit.idsite IN (?) AND log_visit.visit_last_action_time >= ? )
                AND
                ( ( log_link_visit_action.idaction_url IN (SELECT idaction FROM matomo_log_action WHERE ( name LIKE CONCAT('%', ?, '%')  AND type = 1 )) ) )
			GROUP BY
				log_visit.idvisit
			ORDER BY
				NULL
        ) AS log_inner) ) )
			GROUP BY
				log_visit.idvisit
			ORDER BY
				log_visit.idsite DESC, log_visit.visit_last_action_time DESC LIMIT 0, 12

I reckon we could remove one of the inner subquery similar to how we do it in the LogAggregator:

            $logQueryBuilder = StaticContainer::get('Piwik\DataAccess\LogQueryBuilder');
            $forceGroupByBackup = $logQueryBuilder->getForcedInnerGroupBySubselect();
            $logQueryBuilder->forceInnerGroupBySubselect(LogQueryBuilder::FORCE_INNER_GROUP_BY_NO_SUBSELECT);
            $query = $segmentObj->getSelectQuery($select, $from, $where, $bind); // could also catch this
            $logQueryBuilder->forceInnerGroupBySubselect($forceGroupByBackup);

This should get rid of one subquery.

Otherwise looks good to me so far @sgiehl

@sgiehl sgiehl added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Jul 7, 2020
@sgiehl
Copy link
Member Author

sgiehl commented Jul 7, 2020

@tsteur updated the code & tests. Should now be ready for a final review

@sgiehl sgiehl marked this pull request as ready for review July 7, 2020 09:17
@tsteur
Copy link
Member

tsteur commented Jul 8, 2020

Looks generally good.

Is the startDate/endDate maybe missing from core/Archive/ArchiveQueryFactory.php::getSegmentFromQueryParam?

Or can we maybe not return true for $this->doesSegmentNeedSubquery() when no startDate/endDate is given? I just applied a segment and it generated this where it would basically result in a full table scan if this was actually executed:

image

@sgiehl
Copy link
Member Author

sgiehl commented Jul 8, 2020

Is the startDate/endDate maybe missing from core/Archive/ArchiveQueryFactory.php::getSegmentFromQueryParam?

Actually when looking through the code for me it seems the segment query shouldn't be used when loading an Archive. But can set those parameters to be sure

@tsteur
Copy link
Member

tsteur commented Jul 9, 2020

@sgiehl can we still add $this->doesSegmentNeedSubquery to return true only when a start and end date is set? I wan't to prevent by accident it doing a full table scan which could be basically the death for any DB. Or alternatively we need to require a start and end date to be always set.

@sgiehl
Copy link
Member Author

sgiehl commented Jul 9, 2020

@tsteur I've adjusted the code, so it will no longer use a subquery if no start date is given. If that is the case and a segment query is built, that would require a subquery a warning will be logged.
Maybe we could remove the logging after beta cycle or with a later release, but it might be helpful to check if we have any segment class usages that would require setting a date...


if ($requiresSubQuery && empty($this->startDate)) {
$e = new Exception();
Log::warning("Avoiding segment subquery due to missing start date. Please ensure a start date is set when initializing a segment if it's used to build a query. Stacktrace:\n" . $e->getTraceAsString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean when start and end date is empty?

created follow up issue so we don't forget to act on it #16200

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just realise shouldn't have merged this yet as it needs to also check if endDate is empty @sgiehl ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created #16201

@tsteur tsteur merged commit d72b8bf into 4.x-dev Jul 9, 2020
@tsteur tsteur deleted the notcontainsegments branch July 9, 2020 22:41
tsteur added a commit that referenced this pull request Jul 10, 2020
sgiehl added a commit that referenced this pull request Jul 10, 2020
…set (#16201)

* Follow up change only triggering error when no start and end date is set

refs #16172

fyi @sgiehl does this make sense?

* update comment

Co-authored-by: Stefan Giehl <stefan@matomo.org>
@mattab mattab added the Bug For errors / faults / flaws / inconsistencies etc. label Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Needs Review PRs that need a code review
Projects
None yet
3 participants