@tsteur opened this Issue on May 4th 2020 Member

Seeing referrer urls like

https://l.instagram.com/?u=http://www.example.com/&e=BTPctSgKzf_-3AEudknfwySo4YW52B5wk1evGWzkDvpRCQYf5Kid7P6mQVKbEhsNvwC0WY37vj19ZKQb&s=1

It seems this &e= seems something unique to instagram as I see this in many different URLs. See also https://webmasters.stackexchange.com/a/114165

The e parameter most likely stands for external or event. In the worst case this identifies a user and represents personal data?

We do not want to store this parameter in the DB and want to completely remove that string. Potentially we also want to move the s parameter. Not sure what it stands for.

In some examples this might cause millions of different URLs causing millions of different rows when archiving.

@mattab commented on May 4th 2020 Member

+1 to remove the e parameter

@tsteur commented on May 4th 2020 Member

Seems the s parameter can be removed as well. Here's another example https://l.instagram.com/?u=https%3A%2F%2Fexample.com%2Fexample.com&e=BTPcuqWixl6Mf5hgYPp6wXIlstuaEdJssdYEvT9s8-6yme_lb275lY2Bwc-YvE-fZNtSKux4QB-v8xNk&s=1

but the &s=1 also doesn't exist in some URLs. Not quite sure how the data in the referrer url dimension looks like. Might need to match & as well as &amp like (haven't tested below code)

        if (!empty($information['referer_url']) && $information['referer_name'] === 'Instagram') {
            $information['referer_url'] = preg_replace('/\b([&|&]{0,1}e=[^&]*)\b/i','',$information['referer_url']);
        }
@tsteur commented on May 4th 2020 Member

@sgiehl it looks like same might apply to facebook... eg

http://l.facebook.com/l.php?u=http://www.example.com.com/&h=BL0RXrrUUyk_ZbqijDe_mVGBi3ZsyVxJEvOfIhjlUEiRy4zkKwYMDUWbuoICNzhC6pKm6zbGCPAJQP4s8e2psymaokRV3dhp7FPx4Zk6B4x0fBbYTi54xynmBsoBRFB7f5t

For these URLs we can remove the h parameter

@tsteur commented on May 5th 2020 Member

Also (also m.facebook.com/l.php)

http://lm.facebook.com/l.php?u=http://example.com/foobar&h=BT2Dh3r3VDLoabL3Rb1lpmN-_s0lFtReSGzBED3kfUGnaO5fPF-x8LspJAfJN9kkee5ptpybYgyIx68yzgo9kPAN6snSZL_eNcmgu5xhuUcLXJukNKvi0XMOY78Ca9NKexnpJKxKUDeVApPcfB

Also below... not sure but this looks like it contains personal data... wonder if it makes sense to only store https://googleads.g.doubleclick.net/pagead/ads? or https://googleads.g.doubleclick.net/pagead/ads?url=https://example.com/foobar.php?
fyi @mattab

https://googleads.g.doubleclick.net/pagead/ads?client=ca-pub-1086157892257495&output=html&h=300&slotname=9664612556&adk=3030006362&adf=4109546823&w=360&lmt=2588583301&rafmt=1&psa=1&guci=2.2.0.0.2.2.0.0&format=360x300&url=https://example.com/foobar.php?fid=3241&flash=0&fwr=1&rpe=1&resp_fmts=3&sfro=1&wgl=1&dt=1548583300834&bpp=63&bdt=168&idt=174&shv=r30200428&cbv=r24140131&ptt=9&saldr=aa&abxe=1&cookie=ID=7b741e56705a5595:T=158853181:S=A5NI_MZyqUUO8pNhi4diWmFjQk4H8Y-hJA&crv=1&correlator=8391657028466&frm=20&pv=2&ga_vid=1350038101.1588583682&ga_sid=1585583301&ga_hid=135474036&ga_fc=1&ga_wpids=UA-3641475-2&iag=0&icsg=8362&dssz=13&mdo=0&mso=0&u_tz=420&u_his=1&u_java=0&u_h=780&u_w=360&u_ah=780&u_aw=360&u_cd=24&u_nplug=0&u_nmime=0&adx=0&ady=80&biw=360&bih=648&scr_x=0&scr_y=0&eid=21065451,23465474,4471896&oid=3&pvsid=164627556405179&pem=809&ref=https://example.com/foobar.php?aff_sub3=ID-rdr-2&utm_campaign=38350&utm_source=IK-xd3&utm_medium=rdr&rx=0&eae=0&fc=644&brdim=0,0,0,0,360,0,360,648,360,650&vis=1&rsz=||leE|&abl=CS&pfx=0&fu=8334&bc=31&ifi=1&uci=b!1&fsb=1&xpc=iZjViRq6KJ&p=https://example.com&dtd=230

Also maybe below convert to https://main.exoclick.com?

https://main.exoclick.com/click.php?data=H4sIAAAAAALAA1WQT2sCMRDFv8pe9ugyfzLJpLdKkYKHHktPstnsYlCrWKke5sM3brFQJoeZvPd.JNMYq3deokFHaNvL5fTV8nNLq3puh313Hvv9pRzGRSq5G46Hej3sy7BreVVaftm.Lz7KehmXr7qBlvx4O26Gkqvy4JqBGYqq.EDs7Lvsyul4_rzDDDV2KF10XTU5ZNUAaOyVUNEEtEYpGhlixTiuNLS7AW7egcnY55T6XgRK0o6RWNBLzUeB4MSISIJlnkggU558jpBSkkCUAXRiHMep0iv837tgrua.jqfmer02D7WpamNz4Lf5SUDhufMgyPOHUe3PUSd7W1uK09S7wYfolCUF6KeBIXDW5F2Y9Af8Es1PkgEAAA--&wpn=NzQ2MDUxM3xCUkF8ODEwMDE6ODc5fDB8MjIyNTd8Mjl8MHitMXwzNDU3MTUzOmVkOTA3ZTM4N2Y3MTU2OGFiMzQ1MjcwN2Y1IWUzYzM5

Also from Google remove ust and usg parameter?

https://www.google.com/url?q=https://example.com/foo&sa=D&ust=1689581471834000&usg=BCQjCNFw5f1S7rLgPNephpTW_4-i2KnAGA

Then also below could be shortend to maybe https://www.mgid.com?ts=com.google.android.googlequicksearchbox or only https://www.mgid.com?

https://www.mgid.com/ghits/5633098/i/113152/0/pp/6/1?h=BbYLxfMqnKZaGQ2xAYIHcOAQPcWj4dWQv_DGpWVIRdhZieoJnL-LlQSwR7epXrq5&rid=60ckde11-0dfd-11ea-8948-d194662c24f7&ts=com.google.android.googlequicksearchbox&tt=Organic&cpm=1&gbpp=1&k=784880fcib-45T5E4fIWfXH.hQZjfXH.hbQFfbD:fr;fx!fW~f=f4:faI:fV=fO:ffx!fQf.faHR0cHM6Ly9lY39ub215Lm9rZXpv5bmUuY29tL3JlYWQvMjAyMC8wN$8wMy8zMjAvMjI=fYW5kcm9pZC1hcHA6Ly9jb20uZ39vZ2xlLmFuZHJvaWQuZ29vZ2xlcXVpY2tzZWFyY2g=fK45vL2Nvb$5nb29nbGUuYW5kcm9pZC5nb29nbGVxdWlja3NlYXJjaG5veC9odHRwcy9529vZ2xlf*fMzQ2*DQxN5cx*DQwOTc=rMHwxMXw6MXwzNA==nMHwwf!fyfNjI2*Dt2MHww*DE4Mg==ft!fLQfXH.hR.Df!fTW96aWxsY$81LjAgrExpbnV4OyBBbmRyb4lkIDk7IFJlZG1pIDZBK$BBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvK$BDaHJ4bWUvNzguMC4zOTA0LjEwOCBNb2JpbGUgU2FmYXJpLzUzNy4zNg==ffMHwzfTGludXgyYXJtdjdsfNDIwfMHw2NQ==fMzYw*Dcy5A==fY2Vsb5VsYXJ8NGd8MA==f!f!fQf+f*f*&muid=kb3UBFHwRcc3

Then below shorten to https://www.youtube.com/live_chat

https://www.youtube.com/live_chat?continuation=0ofMyANmGlBDamdLRFFvTFYxTkNaMDVWZDJ4U2F4RXFKk29ZVlVOS1gzTXhNVXgwUVdkVE5qbDBXV2t6Y2xwTWFUQlJFZ3RYVTBKblRsVjNiRkpwVVN6ggECCASIAQGgAbfk5bjnmekC

These youtube urls remove redir_token and html_redirect url parameters

https://www.youtube.com/redirect?html_redirect=1&q=https://foo.ly/4e5aedN&redir_token=BAp5R4CDGTNGSSwGxxKIwSNtirZ8MTU4ODYyNjg5MkAxNTg4NTQwNDk7&event=live_chat
https://www.youtube.com/redirect?q=https://foo.ly/4e5aedN&event=live_chat&html_redirect=1&redir_token=1XVoeG470Kc50MbkWHP23KvTaJp8MTU7Nzg3NDM5OUAxNTg3Nzg3OTk6

Could we also remove gclid parameter in general? eg

https://foo.bar.com/baz?gclid=e4334-3434434343434343

from bing remove could remove the URL parameters cvid, refig, cc, elv, plvar

https://www.bing.com/search?q=foo+bar&form=EDGTCK&qs=AB&cvid=ff8399e313a74fb592b0ca1d91c42224&refig=4540178a841b46ce8de1664920449112&cc=BE&setlang=4k-NL&elv=AXXfrEiqqD9r3GuelwApuloWthKnH5oOVtTkjmeLPBeagbGxe4rwyaaV!5HJFcbCTxaO4q5w7QqvI8XbCTXyJKn1N4PzqCvVFSdBSr*sdwlB&plvar=0

@sgiehl most important for now be the instagram and facebook URLs. We can also create a new issue for all the other ones.

@tsteur commented on May 5th 2020 Member

Maybe could also remove any configured url_query_parameter_to_exclude_from_url parameter? (seeing these in quite a few referrers)

@mattab commented on May 5th 2020 Member

Maybe could also remove any configured url_query_parameter_to_exclude_from_url parameter? (seeing these in quite a few referrers)

Yes, that would be great. ideally we'd also need the feature from https://github.com/matomo-org/matomo/issues/15426 "Full" in case people really need this data for some reason (no BC break)?

@sgiehl commented on May 11th 2020 Member

@tsteur can we close that one, or are there still some excludes open that should be added?

@tsteur commented on May 11th 2020 Member

Yes, thanks @sgiehl

This Issue was closed on May 11th 2020
Powered by GitHub Issue Mirror