@GreenReaper opened this Issue on December 19th 2018

While reviewing visits from Tumblr, I expanded referrals and clicked "open row evolution" on the first sub-row:

tumblrevolution

To my surprise it showed a very sparse graph with only a few views each day. I found that whichever of the two rows I clicked evolution on, it was giving me the data for the second row. Hovering over the 'index' links to the left, I found that the first row was for referrals from the HTTPS url, and the second for HTTP.

The parameters encoded within the graph's request URL are:

date=today
apiMethod=Referrers.getWebsites
label=t.umblr.com+>+@index
disableLink=1
module=CoreHome
action=getRowEvolutionPopover
colors={"backgroundColor":"#ffffff","lineColor":"#162c4a","minPointColor":"#ff7f7f","maxPointColor":"#75bf7c","lastPointColor":"#55aaff","fillColor":"#ffffff"}
flat=0
idSite=1
period=day

These requests are the same if I click on either row evolution icon - which makes sense, the path just says "index". But when processing the request, it picked the less-used option - not the option I selected. If it's going to separate them by scheme/URL, duplicating the label, it should use scheme/URL for the evolution.

"Metrics for Website" on the popup itself showed the website domain name as well as the path, but not the access scheme. Perhaps if it had shown "http:" or "https:" as well, this issue would have been more visible.

There seems to be plenty of information within the row's

<tr data-segment-filter="referrerUrl==https%3A%2F%2Ft.umblr.com%2F" data-url-label="https://t.umblr.com/" data-row-metadata="{&quot;segment&quot;:&quot;referrerUrl==https%3A%2F%2Ft.umblr.com%2F&quot;,&quot;url&quot;:&quot;https:\/\/t.umblr.com\/&quot;}" class=" ">

In this particular case it doesn't matter because I can just use the top-level graph which does work as it combines views from all paths (and here, there is only one path). However, in the general case it is a problem:

multipleurl - HTTP or HTTPS? Can't tell at a glance. Row evolution for both picks the second.

I'm using Matamo 3.8.0-b5 on nginx with PHP 7.3 and MariaDB 10.1.37 on Debian Stretch.


Even more generally (maybe this has an issue already): HTTPS and HTTP requests are not combined. While technically these may be completely different sites, in practice they're almost certainly not and would be more useful together. One of our main referrers is, by default, HTTP, but may be configured to force a user's requests to HTTPS. So some visitors from it will be using HTTP and others HTTPS, resulting in two entries.

This makes it difficult to track unique requests from that site (you have to either open them twice or try to remember which complicated numerical URLs you've seen that day) and also decreases the number of useful samples in the top-50 AJAX report available via the dashboard.

Perhaps there should be an option - and I'd argue that it should be the default - to use 'group by' or similar in the database requests to merge entries which will end up with the same label. Of course the same option would have to be used throughout where requests are made, like row evolution. Or the merger could be done in PHP but this would be less efficient and I imagine it wouldn't end up with 50 items in top-50 etc.

@tsteur commented on December 19th 2018 Member

Thanks for that 👍

Any chance you could try this change? https://github.com/matomo-org/matomo/pull/13884/files

I'll also change the title so it's more clear for us what to do.

@GreenReaper commented on December 19th 2018

The change does appear to group the graph. So the line graph now shows more visits than either.
If you click the row evolution it will then show, for example, 7 unique visitors (summing 4 and 3).
So far all well and good. However...

It has a more interesting impact on the referrer path list. Entries are not always grouped. So for example:

  • the 2nd referrer is: https://www.furaffinity.net/view/29795460/ and has 4 unique visitors
  • the 10th referrer is: http://www.furaffinity.net/view/29795460/ and has 3 unique visitors

There are eight referrer rows on the first page which shows as "1-10". I am guessing that on that page, there were two pairs that were merged. So merging between pairs is only done if they are on the same page.

Seemingly the merge has taken place after the "1-10" list is determined. In SQL it would be doing something like (select ... offset 0 limit 10) group by 'label'. To get consistent results, what needs to happen is for the group by to be within the subquery, not in the display filter outside it. That way it will always merge mergable rows and always show the right number for each page of the results list.

There is also the question of whether https: or http: is preferred for the link on the label. I'd tend towards https: but that might misrepresent how many people visit with https and overall it doesn't matter a huge amount - it might be inefficient to indicate a preference, especially if it has to count which has the most hits.

@tsteur commented on December 19th 2018 Member

Cheers makes sense 👍 Could you do me a favour and test the updated PR? https://github.com/matomo-org/matomo/pull/13884/files

@GreenReaper commented on December 19th 2018

The updated PR does not change the above situation. Here is what I am seeing, in case it helps:

views
page1
page2

I'd expect there to be ten entries on the first list page, not eight, and for the unique visitors for the single entry for that path to be 7. The marked path should not be on the second page because it has been grouped into the entry on the first page.

This might be computationally expensive since it has to perform the group on the entire list, rather than just ten selected entries within it, but I'm not sure if it would be significantly more than it already is.

@GreenReaper commented on December 19th 2018

Oh, never mind, I messed up and left the last filter as queueFiter. 😅
I think it is working properly now, I see a 7, but I will test a bit more...

@GreenReaper commented on December 19th 2018

So, it does appear to be grouping! In fact I can tell the actual number in the table has been impacted because in this case the dashboard view for this website is showing 1-41 rather than 1-50.

This is very helpful for me - reducing the duplicates makes that list far more useful.

I should caution that this might not be the desired behaviour if someone particularly wanted to tell whether visitors referred via https vs. http had some difference in actions/average time/bounce rate, etc. . However I'm struggling to think of a case where this would be as useful as having them grouped.

Similarly I'm not sure that the proportion of https vs. http is represented in any way - albeit maybe it impacts the chance for it to be selected as the link URL? I figure that'd be up to MySQL in such a situation, the result not being deterministic (the URL is equivalent to 'B' in their example); but if more referrers are HTTPS, it is more likely to be the one picked to represent the group, and so end up as the link for the label.

This GROUP BY usage is non-portable to other DBMS because of this non-determinism as covered in #5124.

@tsteur commented on December 19th 2018 Member

Same, I think it should be definitely grouped 👍 If needed, one could possibly work around with segments if needed maybe.

Powered by GitHub Issue Mirror