Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram generates heaps of different referrer urls causing out of memory issues #15902

Closed
tsteur opened this issue May 4, 2020 · 8 comments · Fixed by #15905
Closed

Instagram generates heaps of different referrer urls causing out of memory issues #15902

tsteur opened this issue May 4, 2020 · 8 comments · Fixed by #15905
Assignees
Labels
c: Performance For when we could improve the performance / speed of Matomo. c: Privacy For issues that impact or improve the privacy.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented May 4, 2020

Seeing referrer urls like

https://l.instagram.com/?u=http://www.example.com/&e=BTPctSgKzf_-3AEudknfwySo4YW52B5wk1evGWzkDvpRCQYf5Kid7P6mQVKbEhsNvwC0WY37vj19ZKQb&s=1

It seems this &e= seems something unique to instagram as I see this in many different URLs. See also https://webmasters.stackexchange.com/a/114165

The e parameter most likely stands for external or event. In the worst case this identifies a user and represents personal data?

We do not want to store this parameter in the DB and want to completely remove that string. Potentially we also want to move the s parameter. Not sure what it stands for.

In some examples this might cause millions of different URLs causing millions of different rows when archiving.

@tsteur tsteur added the c: Performance For when we could improve the performance / speed of Matomo. label May 4, 2020
@tsteur tsteur added this to the 3.13.6 milestone May 4, 2020
@mattab
Copy link
Member

mattab commented May 4, 2020

+1 to remove the e parameter

@tsteur
Copy link
Member Author

tsteur commented May 4, 2020

Seems the s parameter can be removed as well. Here's another example https://l.instagram.com/?u=https%3A%2F%2Fexample.com%2Fexample.com&e=BTPcuqWixl6Mf5hgYPp6wXIlstuaEdJssdYEvT9s8-6yme_lb275lY2Bwc-YvE-fZNtSKux4QB-v8xNk&s=1

but the &s=1 also doesn't exist in some URLs. Not quite sure how the data in the referrer url dimension looks like. Might need to match & as well as &amp like (haven't tested below code)

        if (!empty($information['referer_url']) && $information['referer_name'] === 'Instagram') {
	        $information['referer_url'] = preg_replace('/\b([&|&]{0,1}e=[^&]*)\b/i','',$information['referer_url']);
        }

@mattab mattab added the c: Privacy For issues that impact or improve the privacy. label May 4, 2020
@sgiehl sgiehl self-assigned this May 4, 2020
@tsteur
Copy link
Member Author

tsteur commented May 4, 2020

@sgiehl it looks like same might apply to facebook... eg

http://l.facebook.com/l.php?u=http://www.example.com.com/&h=BL0RXrrUUyk_ZbqijDe_mVGBi3ZsyVxJEvOfIhjlUEiRy4zkKwYMDUWbuoICNzhC6pKm6zbGCPAJQP4s8e2psymaokRV3dhp7FPx4Zk6B4x0fBbYTi54xynmBsoBRFB7f5t

For these URLs we can remove the h parameter

@tsteur
Copy link
Member Author

tsteur commented May 5, 2020

Also (also m.facebook.com/l.php)

http://lm.facebook.com/l.php?u=http://example.com/foobar&h=BT2Dh3r3VDLoabL3Rb1lpmN-_s0lFtReSGzBED3kfUGnaO5fPF-x8LspJAfJN9kkee5ptpybYgyIx68yzgo9kPAN6snSZL_eNcmgu5xhuUcLXJukNKvi0XMOY78Ca9NKexnpJKxKUDeVApPcfB

Also below... not sure but this looks like it contains personal data... wonder if it makes sense to only store https://googleads.g.doubleclick.net/pagead/ads? or https://googleads.g.doubleclick.net/pagead/ads?url=https://example.com/foobar.php?
fyi @mattab

https://googleads.g.doubleclick.net/pagead/ads?client=ca-pub-1086157892257495&output=html&h=300&slotname=9664612556&adk=3030006362&adf=4109546823&w=360&lmt=2588583301&rafmt=1&psa=1&guci=2.2.0.0.2.2.0.0&format=360x300&url=https://example.com/foobar.php?fid=3241&flash=0&fwr=1&rpe=1&resp_fmts=3&sfro=1&wgl=1&dt=1548583300834&bpp=63&bdt=168&idt=174&shv=r30200428&cbv=r24140131&ptt=9&saldr=aa&abxe=1&cookie=ID=7b741e56705a5595:T=158853181:S=A5NI_MZyqUUO8pNhi4diWmFjQk4H8Y-hJA&crv=1&correlator=8391657028466&frm=20&pv=2&ga_vid=1350038101.1588583682&ga_sid=1585583301&ga_hid=135474036&ga_fc=1&ga_wpids=UA-3641475-2&iag=0&icsg=8362&dssz=13&mdo=0&mso=0&u_tz=420&u_his=1&u_java=0&u_h=780&u_w=360&u_ah=780&u_aw=360&u_cd=24&u_nplug=0&u_nmime=0&adx=0&ady=80&biw=360&bih=648&scr_x=0&scr_y=0&eid=21065451,23465474,4471896&oid=3&pvsid=164627556405179&pem=809&ref=https://example.com/foobar.php?aff_sub3=ID-rdr-2&utm_campaign=38350&utm_source=IK-xd3&utm_medium=rdr&rx=0&eae=0&fc=644&brdim=0,0,0,0,360,0,360,648,360,650&vis=1&rsz=||leE|&abl=CS&pfx=0&fu=8334&bc=31&ifi=1&uci=b!1&fsb=1&xpc=iZjViRq6KJ&p=https://example.com&dtd=230

Also maybe below convert to https://main.exoclick.com?

https://main.exoclick.com/click.php?data=H4sIAAAAAALAA1WQT2sCMRDFv8pe9ugyfzLJpLdKkYKHHktPstnsYlCrWKke5sM3brFQJoeZvPd.JNMYq3deokFHaNvL5fTV8nNLq3puh313Hvv9pRzGRSq5G46Hej3sy7BreVVaftm.Lz7KehmXr7qBlvx4O26Gkqvy4JqBGYqq.EDs7Lvsyul4_rzDDDV2KF10XTU5ZNUAaOyVUNEEtEYpGhlixTiuNLS7AW7egcnY55T6XgRK0o6RWNBLzUeB4MSISIJlnkggU558jpBSkkCUAXRiHMep0iv837tgrua.jqfmer02D7WpamNz4Lf5SUDhufMgyPOHUe3PUSd7W1uK09S7wYfolCUF6KeBIXDW5F2Y9Af8Es1PkgEAAA--&wpn=NzQ2MDUxM3xCUkF8ODEwMDE6ODc5fDB8MjIyNTd8Mjl8MHitMXwzNDU3MTUzOmVkOTA3ZTM4N2Y3MTU2OGFiMzQ1MjcwN2Y1IWUzYzM5

Also from Google remove ust and usg parameter?

https://www.google.com/url?q=https://example.com/foo&sa=D&ust=1689581471834000&usg=BCQjCNFw5f1S7rLgPNephpTW_4-i2KnAGA

Then also below could be shortend to maybe https://www.mgid.com?ts=com.google.android.googlequicksearchbox or only https://www.mgid.com?

https://www.mgid.com/ghits/5633098/i/113152/0/pp/6/1?h=BbYLxfMqnKZaGQ2xAYIHcOAQPcWj4dWQv_DGpWVIRdhZieoJnL-LlQSwR7epXrq5&rid=60ckde11-0dfd-11ea-8948-d194662c24f7&ts=com.google.android.googlequicksearchbox&tt=Organic&cpm=1&gbpp=1&k=784880fcib-45T5E4fIWfXH.hQZjfXH.hbQFfbD:fr;fx!fW~f=f4:faI:fV=fO:ffx!fQf.faHR0cHM6Ly9lY39ub215Lm9rZXpv5bmUuY29tL3JlYWQvMjAyMC8wN$8wMy8zMjAvMjI=fYW5kcm9pZC1hcHA6Ly9jb20uZ39vZ2xlLmFuZHJvaWQuZ29vZ2xlcXVpY2tzZWFyY2g=fK45vL2Nvb$5nb29nbGUuYW5kcm9pZC5nb29nbGVxdWlja3NlYXJjaG5veC9odHRwcy9529vZ2xlf*fMzQ2*DQxN5cx*DQwOTc=rMHwxMXw6MXwzNA==nMHwwf!fyfNjI2*Dt2MHww*DE4Mg==ft!fLQfXH.hR.Df!fTW96aWxsY$81LjAgrExpbnV4OyBBbmRyb4lkIDk7IFJlZG1pIDZBK$BBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvK$BDaHJ4bWUvNzguMC4zOTA0LjEwOCBNb2JpbGUgU2FmYXJpLzUzNy4zNg==ffMHwzfTGludXgyYXJtdjdsfNDIwfMHw2NQ==fMzYw*Dcy5A==fY2Vsb5VsYXJ8NGd8MA==f!f!fQf+f*f*&muid=kb3UBFHwRcc3

Then below shorten to https://www.youtube.com/live_chat

https://www.youtube.com/live_chat?continuation=0ofMyANmGlBDamdLRFFvTFYxTkNaMDVWZDJ4U2F4RXFKk29ZVlVOS1gzTXhNVXgwUVdkVE5qbDBXV2t6Y2xwTWFUQlJFZ3RYVTBKblRsVjNiRkpwVVN6ggECCASIAQGgAbfk5bjnmekC

These youtube urls remove redir_token and html_redirect url parameters

https://www.youtube.com/redirect?html_redirect=1&q=https://foo.ly/4e5aedN&redir_token=BAp5R4CDGTNGSSwGxxKIwSNtirZ8MTU4ODYyNjg5MkAxNTg4NTQwNDk7&event=live_chat
https://www.youtube.com/redirect?q=https://foo.ly/4e5aedN&event=live_chat&html_redirect=1&redir_token=1XVoeG470Kc50MbkWHP23KvTaJp8MTU7Nzg3NDM5OUAxNTg3Nzg3OTk6

Could we also remove gclid parameter in general? eg

https://foo.bar.com/baz?gclid=e4334-3434434343434343

from bing remove could remove the URL parameters cvid, refig, cc, elv, plvar

https://www.bing.com/search?q=foo+bar&form=EDGTCK&qs=AB&cvid=ff8399e313a74fb592b0ca1d91c42224&refig=4540178a841b46ce8de1664920449112&cc=BE&setlang=4k-NL&elv=AXXfrEiqqD9r3GuelwApuloWthKnH5oOVtTkjmeLPBeagbGxe4rwyaaV!5HJFcbCTxaO4q5w7QqvI8XbCTXyJKn1N4PzqCvVFSdBSr*sdwlB&plvar=0

@sgiehl most important for now be the instagram and facebook URLs. We can also create a new issue for all the other ones.

@tsteur
Copy link
Member Author

tsteur commented May 5, 2020

Maybe could also remove any configured url_query_parameter_to_exclude_from_url parameter? (seeing these in quite a few referrers)

@sgiehl sgiehl linked a pull request May 5, 2020 that will close this issue
@mattab
Copy link
Member

mattab commented May 5, 2020

Maybe could also remove any configured url_query_parameter_to_exclude_from_url parameter? (seeing these in quite a few referrers)

Yes, that would be great. ideally we'd also need the feature from #15426 "Full" in case people really need this data for some reason (no BC break)?

@sgiehl
Copy link
Member

sgiehl commented May 11, 2020

@tsteur can we close that one, or are there still some excludes open that should be added?

@tsteur
Copy link
Member Author

tsteur commented May 11, 2020

Yes, thanks @sgiehl

@tsteur tsteur closed this as completed May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. c: Privacy For issues that impact or improve the privacy.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants