Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referrer Websites report should group same label (HTTP vs. HTTPS) #13882

Closed
GreenReaper opened this issue Dec 19, 2018 · 7 comments · Fixed by #14358
Closed

Referrer Websites report should group same label (HTTP vs. HTTPS) #13882

GreenReaper opened this issue Dec 19, 2018 · 7 comments · Fixed by #14358
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Milestone

Comments

@GreenReaper
Copy link

GreenReaper commented Dec 19, 2018

While reviewing visits from Tumblr, I expanded referrals and clicked "open row evolution" on the first sub-row:

tumblrevolution

To my surprise it showed a very sparse graph with only a few views each day. I found that whichever of the two rows I clicked evolution on, it was giving me the data for the second row. Hovering over the 'index' links to the left, I found that the first row was for referrals from the HTTPS url, and the second for HTTP.

The parameters encoded within the graph's request URL are:

date=today
apiMethod=Referrers.getWebsites
label=t.umblr.com+>+@Index
disableLink=1
module=CoreHome
action=getRowEvolutionPopover
colors={"backgroundColor":"#ffffff","lineColor":"#162c4a","minPointColor":"#ff7f7f","maxPointColor":"#75bf7c","lastPointColor":"#55aaff","fillColor":"#ffffff"}
flat=0
idSite=1
period=day

These requests are the same if I click on either row evolution icon - which makes sense, the path just says "index". But when processing the request, it picked the less-used option - not the option I selected. If it's going to separate them by scheme/URL, duplicating the label, it should use scheme/URL for the evolution.

"Metrics for Website" on the popup itself showed the website domain name as well as the path, but not the access scheme. Perhaps if it had shown "http:" or "https:" as well, this issue would have been more visible.

There seems to be plenty of information within the row's to make this distinction, e.g.:
<tr data-segment-filter="referrerUrl==https%3A%2F%2Ft.umblr.com%2F" data-url-label="https://t.umblr.com/" data-row-metadata="{&quot;segment&quot;:&quot;referrerUrl==https%3A%2F%2Ft.umblr.com%2F&quot;,&quot;url&quot;:&quot;https:\/\/t.umblr.com\/&quot;}" class=" ">

In this particular case it doesn't matter because I can just use the top-level graph which does work as it combines views from all paths (and here, there is only one path). However, in the general case it is a problem:

multipleurl - HTTP or HTTPS? Can't tell at a glance. Row evolution for both picks the second.

I'm using Matamo 3.8.0-b5 on nginx with PHP 7.3 and MariaDB 10.1.37 on Debian Stretch.


Even more generally (maybe this has an issue already): HTTPS and HTTP requests are not combined. While technically these may be completely different sites, in practice they're almost certainly not and would be more useful together. One of our main referrers is, by default, HTTP, but may be configured to force a user's requests to HTTPS. So some visitors from it will be using HTTP and others HTTPS, resulting in two entries.

This makes it difficult to track unique requests from that site (you have to either open them twice or try to remember which complicated numerical URLs you've seen that day) and also decreases the number of useful samples in the top-50 AJAX report available via the dashboard.

Perhaps there should be an option - and I'd argue that it should be the default - to use 'group by' or similar in the database requests to merge entries which will end up with the same label. Of course the same option would have to be used throughout where requests are made, like row evolution. Or the merger could be done in PHP but this would be less efficient and I imagine it wouldn't end up with 50 items in top-50 etc.

tsteur added a commit that referenced this issue Dec 19, 2018
I just debugged and can confirm behaviour described in #13882 
The label includes a URL like `http://demo.matomo.org/foo/bar` and the label is then removed in the `ColumnCallbackReplace` filter meaning there could be several entries with the same label and a group by is needed.
@tsteur
Copy link
Member

tsteur commented Dec 19, 2018

Thanks for that 👍

Any chance you could try this change? https://github.com/matomo-org/matomo/pull/13884/files

I'll also change the title so it's more clear for us what to do.

@tsteur tsteur added this to the 3.10.0 milestone Dec 19, 2018
@tsteur tsteur added the Bug For errors / faults / flaws / inconsistencies etc. label Dec 19, 2018
@tsteur tsteur changed the title Row evolution graph can pick wrong data for popover (HTTP vs. HTTPS) Referrer Websites report should group same label (HTTP vs. HTTPS) Dec 19, 2018
@GreenReaper
Copy link
Author

GreenReaper commented Dec 19, 2018

The change does appear to group the graph. So the line graph now shows more visits than either.
If you click the row evolution it will then show, for example, 7 unique visitors (summing 4 and 3).
So far all well and good. However...

It has a more interesting impact on the referrer path list. Entries are not always grouped. So for example:

  • the 2nd referrer is: https://www.furaffinity.net/view/29795460/ and has 4 unique visitors
  • the 10th referrer is: http://www.furaffinity.net/view/29795460/ and has 3 unique visitors

There are eight referrer rows on the first page which shows as "1-10". I am guessing that on that page, there were two pairs that were merged. So merging between pairs is only done if they are on the same page.

Seemingly the merge has taken place after the "1-10" list is determined. In SQL it would be doing something like (select ... offset 0 limit 10) group by 'label'. To get consistent results, what needs to happen is for the group by to be within the subquery, not in the display filter outside it. That way it will always merge mergable rows and always show the right number for each page of the results list.

There is also the question of whether https: or http: is preferred for the link on the label. I'd tend towards https: but that might misrepresent how many people visit with https and overall it doesn't matter a huge amount - it might be inefficient to indicate a preference, especially if it has to count which has the most hits.

@tsteur
Copy link
Member

tsteur commented Dec 19, 2018

Cheers makes sense 👍 Could you do me a favour and test the updated PR? https://github.com/matomo-org/matomo/pull/13884/files

@GreenReaper
Copy link
Author

GreenReaper commented Dec 19, 2018

The updated PR does not change the above situation. Here is what I am seeing, in case it helps:

views
page1
page2

I'd expect there to be ten entries on the first list page, not eight, and for the unique visitors for the single entry for that path to be 7. The marked path should not be on the second page because it has been grouped into the entry on the first page.

This might be computationally expensive since it has to perform the group on the entire list, rather than just ten selected entries within it, but I'm not sure if it would be significantly more than it already is.

@GreenReaper
Copy link
Author

Oh, never mind, I messed up and left the last filter as queueFiter. 😅
I think it is working properly now, I see a 7, but I will test a bit more...

@GreenReaper
Copy link
Author

GreenReaper commented Dec 19, 2018

So, it does appear to be grouping! In fact I can tell the actual number in the table has been impacted because in this case the dashboard view for this website is showing 1-41 rather than 1-50.

This is very helpful for me - reducing the duplicates makes that list far more useful.

I should caution that this might not be the desired behaviour if someone particularly wanted to tell whether visitors referred via https vs. http had some difference in actions/average time/bounce rate, etc. . However I'm struggling to think of a case where this would be as useful as having them grouped.

Similarly I'm not sure that the proportion of https vs. http is represented in any way - albeit maybe it impacts the chance for it to be selected as the link URL? I figure that'd be up to MySQL in such a situation, the result not being deterministic (the URL is equivalent to 'B' in their example); but if more referrers are HTTPS, it is more likely to be the one picked to represent the group, and so end up as the link for the label.

This GROUP BY usage is non-portable to other DBMS because of this non-determinism as covered in #5124.

@tsteur
Copy link
Member

tsteur commented Dec 19, 2018

Same, I think it should be definitely grouped 👍 If needed, one could possibly work around with segments if needed maybe.

@mattab mattab modified the milestones: 3.11.0, 3.10.0 Mar 19, 2019
@katebutler katebutler self-assigned this Mar 27, 2019
diosmosis pushed a commit that referenced this issue May 20, 2019
* Make sure to group the label in Referrer Websites report

I just debugged and can confirm behaviour described in #13882 
The label includes a URL like `http://demo.matomo.org/foo/bar` and the label is then removed in the `ColumnCallbackReplace` filter meaning there could be several entries with the same label and a group by is needed.

* make sure to apply filter before paging

* Add test + slight optimization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants