Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segments using "Event category is not X" or "Page URL does not contain" etc. will list visits with these actions in the Visits log #13785

Closed
NanorPiwik opened this issue Dec 1, 2018 · 20 comments · Fixed by #16172
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Milestone

Comments

@NanorPiwik
Copy link

Hi guys,
I am using Matomo 3.7.0
when I create a segment such as, Event Category is not 'my-value' the visitor log report keep showing me visits which include this event.
Is that a known behavior?

@tsteur tsteur added the Bug For errors / faults / flaws / inconsistencies etc. label Dec 1, 2018
@tsteur
Copy link
Member

tsteur commented Dec 1, 2018

this is because of d89a08b#diff-3a1e3a40187c4a2123a617b97506e102R183
which was supposed to fix #3932

No idea why we would match null values when using not equals and not contains in https://github.com/matomo-org/matomo/blob/3.8.0-b2/core/Segment/SegmentExpression.php#L233

maybe @mattab remembers

@mattab
Copy link
Member

mattab commented Dec 3, 2018

No idea why we would match null values when using not equals and not contains in https://github.com/matomo-org/matomo/blob/3.8.0-b2/core/Segment/SegmentExpression.php#L233

@tsteur I think the idea is:

  • for not equals eg. "City is not equal to London" then we expect to return all visits that had another city set, or those that didn't have a city set/geo-located.
  • for not contains eg. "Keyword does not contain X" should return entries that had any keyword set, and also no keyword set.

@tsteur
Copy link
Member

tsteur commented Dec 3, 2018

I see... didn't even know MySQL wouldn't return those results... just double tested it locally with a null value

select config_device_type from piwik_log_visit where idvisit = 5281842 and config_device_type <> 'foo' where config_device_type is null... the entry was not returned...

not sure how to fix that issue then . I reckon #11900 might fix it, but hard to tell...

@mattab mattab added this to the 3.10.0 milestone Dec 11, 2018
@tsteur
Copy link
Member

tsteur commented Jul 14, 2019

Moving this to Matomo 4 as #11900 is in this milestone as well

@tsteur tsteur modified the milestones: 3.11.0, 4.0.0 Jul 14, 2019
@mattab
Copy link
Member

mattab commented Oct 22, 2019

fyi: was hoping this was maybe fixed along #11900 but still seeing some visits that have events that should be excluded in the segment. For example at this url in this instance there are visits with the MediaVideo event category still.

@tsteur
Copy link
Member

tsteur commented Oct 22, 2019

Visitor Log is a totally different story... and this issue in particular anyway see previous comments

@mattab mattab changed the title Segments are not working when applied to events Segments using "Event category is not X" lists visits with event X in Visits log Oct 22, 2019
@mattab mattab changed the title Segments using "Event category is not X" lists visits with event X in Visits log Segments using "Event category is not X" or "Page URL does not contain" etc. will list visits with these actions in the Visits log Mar 19, 2020
@mattab
Copy link
Member

mattab commented Mar 19, 2020

Same issue with "Page URL does not contain" was reported in #15708

image

@mattab
Copy link
Member

mattab commented Mar 23, 2020

maybe a solution to this issue could be to introduce a new Segment "All Page URLs in the visit". As opposed to just "Page URL" this new segment would ensure that all pages in the visit match the condition. for example:

  • "All Page URLs in the visit" is not https://host/page -> would return only visits where none of the pages urls are https://host/page
  • "All Page URLs in the visit" does not contain STRING -> would return only visits where none of the pages urls contain STRING

Wondering if this would be feasible technically and how much effort that would be? cc @tsteur

Edit: one challenge is that we'd need to duplicate all action segments to have their "All .... in the visit" segment. Would be a lot of new segments.

@tsteur
Copy link
Member

tsteur commented Mar 24, 2020

I reckon this be more work and even a more hacky solution cause somewhere in the segmentation code it would require some IF segmentName = XYZ then change behaviour of does not contain. We need to instead change the way the does not contain filter works in general otherwise you have the issue for other segments as well and things get very messy and unpredictable and untestable.

@sgiehl
Copy link
Member

sgiehl commented Apr 22, 2020

@tsteur @mattab how to proceed with this issue? What exactly should be changed?
Btw. is it still important for first Matomo 4 RC or should it be postponed?

@tsteur
Copy link
Member

tsteur commented Apr 22, 2020

I reckon it be good to fix this as part of Matomo 4 as early as possible just in case it needs some bigger changes eg how tables are joined etc. This allows us to have this feature tested from the beginning of the first beta and increases the chances to notice regressions.

@tsteur
Copy link
Member

tsteur commented Apr 22, 2020

AFAIK the goal be as shown in #13785 (comment)

image

The visit isn't expected to be shown. Whether we'll merge it in the end depends on the fix whether it causes big performance regressions etc.

@sgiehl
Copy link
Member

sgiehl commented Jun 3, 2020

Had some thoughts on that issue and tried to check which parts of the code would need to be adjusted. I guess that might be a tough one.
Querying the visitorlog for a segment like pageurl*=diving would currently result in a query like this:

SELECT log_visit.*
FROM matomo_log_visit AS log_visit
LEFT JOIN matomo_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
WHERE (log_visit.idsite IN (?) AND log_visit.visit_last_action_time >= ? AND log_visit.visit_last_action_time <= ?) AND ((
(log_link_visit_action.idaction_url NOT IN ( 
SELECT idaction
FROM matomo_log_action
WHERE (name NOT LIKE CONCAT('%', ?, '%') AND TYPE = 1))) OR 
(log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name NOT LIKE CONCAT('%', ?, '%') AND TYPE = 3))) OR 
(log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name NOT LIKE CONCAT('%', ?, '%') AND TYPE = 2))) OR 
(log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name NOT LIKE CONCAT('%', ?, '%') AND TYPE = 10)))
))
GROUP BY log_visit.idvisit
ORDER BY log_visit.idsite DESC, log_visit.visit_last_action_time DESC
LIMIT 0, 12

The only way I can think of how to adjust the query so it does not return any visits that had an action that contained that url is to kind of invert the query. That means fetching the visitid that had an action with that url and use that result to fetch the visits that are not included in that list.
So some query like this:

SELECT log_visit.*
FROM matomo_log_visit AS log_visit
WHERE (log_visit.idsite IN (?) AND log_visit.visit_last_action_time >= ? AND log_visit.visit_last_action_time <= ?) AND 
(log_visit.idvisit NOT IN (
SELECT log_link_visit_action.idvisit
FROM matomo_log_link_visit_action AS log_link_visit_action
WHERE 
(log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 1))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 3))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 2))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 10)))
))
ORDER BY log_visit.idsite DESC, log_visit.visit_last_action_time DESC
LIMIT 0, 12

But guess it would be hard to adjust the code so it does that query instead and the subquery for all visits that contains something might have a lot results, which might be a performance issue.

Or does anyone see another possibility to adjust the query so it has the result we want to have?

@tsteur
Copy link
Member

tsteur commented Jun 4, 2020

Can't think of any other solution. I can see how that would be very difficult to implement plus how it may be a performance issue. Personally, would maybe simply keep existing behaviour as it's not necessarily wrong but of course also not ideal.

Also I reckon this is maybe not something that has to be done in Matomo 4 maybe as it's not breaking API.

Note that any change here would also apply to regular reports I suppose and not just visits log. So it will also make these reports slower.

Any thoughts @mattab ?

@mattab
Copy link
Member

mattab commented Jun 12, 2020

if i understand correctly it would make the segments that use "Does not contain" slower, but would not affect other reports or segments. because currently it does not work as expected which creates confusion and also frustration (when you actually need to select visits that didn't view a particular event), it would be great to fix it soon-ish. but clearly we don't have to do it in 4.0 though and it would be fine to do it in 4.x but if possible, it would be valuable to fix this issue this year..

@sgiehl
Copy link
Member

sgiehl commented Jun 16, 2020

Ok. Shall we move the issue to 4.1 or 4.2 then?

@tsteur
Copy link
Member

tsteur commented Jun 16, 2020

Sure can move it to 4.1.

Although I'm wondering now if we do it anyway we might want to do this in 4.0 since this might have implication on some core logic that could maybe affect plugins, not sure. At least there will be a longer beta/RC period so better chance to find errors because of this change and what it impacts and there would be some time to revert this change again if needed etc. Let's do it then in 4.0

@mattab
Copy link
Member

mattab commented Jun 26, 2020

Yes, agreed that it would be better and safer to do it in 4.0. Otherwise It would be a bit risky/stressful to push it later without a long beta cycle

@sgiehl sgiehl self-assigned this Jun 30, 2020
@sgiehl
Copy link
Member

sgiehl commented Jun 30, 2020

I had some more thoughts on that now, but actually I'm not sure how we could solve that without a big performance decrease.

I still have no idea how we can solve that with a nice query. So there is only the query I mentioned above:

SELECT log_visit.*
FROM matomo_log_visit AS log_visit
WHERE (log_visit.idsite IN (?) AND log_visit.visit_last_action_time >= ? AND log_visit.visit_last_action_time <= ?) AND 
(log_visit.idvisit NOT IN (
SELECT log_link_visit_action.idvisit
FROM matomo_log_link_visit_action AS log_link_visit_action
WHERE 
(log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 1))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 3))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 2))) OR (log_link_visit_action.idaction_url IN (
SELECT idaction
FROM matomo_log_action
WHERE (name LIKE CONCAT('%', ?, '%') AND TYPE = 10)))
))
ORDER BY log_visit.idsite DESC, log_visit.visit_last_action_time DESC
LIMIT 0, 12

But with the way we are currently building queries it won't be possible to build a query like that. We can't simply concatenate all single segment query expressions like it is currently done. We kind of need to combine them by table so we are able to only have one subquery as above. Otherwise we might would have multiple subqueries with each action type.

But even with only one subquery that query might become very slow, as the subquery might have too many results (it might even select all idvisit entries from log_link_visit_action). In addition such a query might explode a bit if someone combines multiple segments in that way.

Also I guess such subqueries shouldn't be implemented for all cases, but only for those where it's needed. But that will actually make the code complexity for that even worse.

@tsteur @diosmosis Has anyone a idea how we could solve that a better way? Otherwise I'll give it a try and will start implementing a POC tomorrow.

@tsteur
Copy link
Member

tsteur commented Jun 30, 2020

I guess the subquery or multiple subqueries (if multiple NOT IN are used) is not avoidable and it will make visits log and archiving slow.

The idea be to somewhere use

        $segment = '// part of the defined segment that should be matched in NOT IN eg eventCategory==="my-value"';
        $segment = new Segment($segment, array($idSite));
        $params = new ArchiveProcessor\Parameters($site, $period, $segment);
        $logAggregator = new LogAggregator($params);
        $select = 'log_visit.idvisit';
        $from = 'log_visit';
        $where = $logAggregator->getWhereStatement('log_visit', 'visit_last_action_time');
        $query = $logAggregator->generateQuery($select, $from, $where, 'log_visit.idvisit', '');

This gives us the inner query... eg if the segment is eventCategory does not contain "myValue" then we'd need to invert the segment expression to "event category contains "my-value".

Then the actual segment expression needs to be rewritten to use AND log_visit.idvisit NOT IN ('..').

From a first look this can be probably done in LogQueryBuilder::getSelectQueryString() when the segmentExpression is not empty where we already merge in the bind etc. Although the problem is likely that we won't have the idSite and date limits so this will be the challenge... In segment expression we don't have any of the archive parameters either. We'd need to refactor to have the archive parameters with idSite, startDate, endDate anywhere where we could maybe rewrite this query. Maybe this is what you meant? Currently the subquery you mentioned would do a full table scan if I see this right.

I reckon the best place to make this work might be Segment::getSelectQuery() which is used by LogAggregator::generateQuery() and the visits log. There we have idSites, startDate, endDate, etc. We'd basically need to pass this info of idSite, startDate, endDate into SegmentExpression where we can then change the segment definition for NOT IN as mentioned above within SegmentExpression. It will likely break/change the getSelectQuery so it's good to have this in Matomo 4 since we're using this in plugins.

Not sure if that helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Projects
None yet
4 participants