Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize referrer domains #4033

Open
gka opened this issue Jul 4, 2013 · 10 comments
Open

Normalize referrer domains #4033

gka opened this issue Jul 4, 2013 · 10 comments
Labels
c: Usability For issues that let users achieve a defined goal more effectively or efficiently. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.

Comments

@gka
Copy link
Contributor

gka commented Jul 4, 2013

Listing the referrer websites can be significantly improved by normalizing the domain names. Currently subdomains such as "www7" are treated as separate website. Here's an example of such a referrer list, in which you see that lemonde.fr is listed several times:

[[Image(http://new.tinygrab.com/f3aa221edeba52ea05e91e20b51690a2c38c508b47.png)]]

Of course this is not trivial, as some sub-domains are pointing to separate websites while others are only mirrors or mobile variants of the same site.

To solve this issue, Mozilla maintains a list of "effective" tld names. This list includes domains such as bl0gsp0t.com and dyndns.org, because X.dyndns.org should be treated as a separate websites.

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Using this list it is easy to normalize the domains, or in other words, to extract the "effective" websites. The list is not perfect (for instance tumbr.com is missing) but it should solve 95% of the problem.

@mattab
Copy link
Member

mattab commented Jul 14, 2013

Good idea to use a list to improve the referrer website.
For lemonde example though, I feel like having all the subdomains brings value as it helps seeing which sub-sites bring more traffic. lemonde is not in the list so it makes sense.

We could also implement this as a plugin in the upcoming marketplace at: http://plugins.piwik.org/

@gka
Copy link
Contributor Author

gka commented Jul 16, 2013

Another very smart solution would be to do just group the visits by domain and subdomain. This seems to be easier as we don't need to maintain the effective tld list at all. The result could look like this:

||= Website =||= Visits =||
|| guardian.co.uk || 503108||
|| lemonde.fr || 303471||
|| - www.lemonde.fr || 177113||
|| - decodeurs.blog.lemonde.fr || 83375||
|| - emploi.blog.lemonde.fr || 30323||
|| - abonnes.lemonde.fr || 7412||
|| - mobile.lemonde.fr || 2652||
|| - alicedsl.lemonde.fr || 2596||
|| derstandard.at || 58850||

Ok, we might still need to maintain a shorter list of effective TLDs where we put some country-specific TLDs in, such as co.uk, but we don't need to cover company specific TLDs such as blogsp0t.com, as users can easily unfold the domain to see what blogs are linking most.

(btw I hate this comment system which always blacklists my comments just because I include blogsp0t.com. silly!)

@mattab
Copy link
Member

mattab commented Jul 16, 2013

Great idea to add a new "view" of the report with subtables showing subdomains.

Maybe we show such new report as a new footer link Related Report "Websites by Domain" under "Websites" report

Or maybe as a "COG" dropdown option.

@gka
Copy link
Contributor Author

gka commented Jul 22, 2013

I would prefer making the hierarchical view the new default and then let the user "make it flat" as we are doing with the Pages report.

Anyone thinking that the flat view is better than grouping by domain?

@mattab
Copy link
Member

mattab commented Jan 13, 2014

Nice idea for a plugin which could filter out the Referrers dataTable to make the grouping as explained here!

@gka gka added this to the Future releases milestone Jul 8, 2014
@gka
Copy link
Contributor Author

gka commented Oct 17, 2014

As a first step toward this I worked on a PHP implementation for extracting the "effective" domain name of an hostname.

Usage is very simple:

> include('EffectiveDomainName.php');

> print EffectiveDomainName::get('mobile.nytimes.com') . "\n";
nytimes.com

> print EffectiveDomainName::get('flightjs.github.io') . "\n";
flightjs.github.io

> print EffectiveDomainName::get('www.google.com.br') . "\n";
google.com.br

https://github.com/gka/effective-domain-name

@mattab mattab reopened this Nov 27, 2014
@mattab
Copy link
Member

mattab commented Nov 27, 2014

@gka Thanks for the tip.

Weird that this issue got closed, I don't think I closed it unless it was by mistake...

It would be relatively easy to create a plugin that will either modify existing getWebsites or add new related report report where we will call a filter GroupBy that will group rows by "effective domain".

@mattab mattab removed the worksforme The issue cannot be reproduced and things work as intended. label Nov 27, 2014
@mattab mattab modified the milestones: Mid term, Long term Nov 27, 2014
@mattab mattab added the c: Usability For issues that let users achieve a defined goal more effectively or efficiently. label Nov 27, 2014
@mattab
Copy link
Member

mattab commented Nov 27, 2014

Would you also group t.co under twitter.com ?

and maybe group m.facebook.com and lm.facebook.com under facebook.com ?

@gka
Copy link
Contributor Author

gka commented Nov 27, 2014

Since facebook.com is not listed as effective TLD (aka "public suffix"), any subdomain *.facebook.com will indeed be "normalized" to facebook.com. However, t.co is not being "grouped" with twitter.com, as both are entirely different domains.

@mattab mattab modified the milestones: Short term, Mid term Dec 1, 2014
@mattab
Copy link
Member

mattab commented Dec 1, 2014

Hi @gka alright

maybe we could use your list and then customise it with all known social networks domains for example.
I'm setting to Short term as it's quite easy to build this at least in a plugin on the Marketplace

we'd simply apply the normalisation function in a custom filter, that would GroupBy the labels by the normalisation function. it would ideally be possible to disable it in the Cog icon menu.

@mattab mattab modified the milestones: Short term, Mid term Apr 7, 2015
@mattab mattab removed this from the Long term milestone Dec 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Usability For issues that let users achieve a defined goal more effectively or efficiently. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

2 participants