@mattab opened this Pull Request on November 25th 2020 Member

According to one user, this has helped worked around the issues with the google ads submission.

Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.

Review

  • [ ] Functional review done
  • [ ] Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • [ ] Security review done see checklist
  • [ ] Code review done
  • [ ] Tests were added if useful/possible
  • [ ] Reviewed for breaking changes
  • [ ] Developer changelog updated if needed
  • [ ] Documentation added if needed
  • [ ] Existing documentation updated if needed
@Findus23 commented on November 25th 2020 Member

Just keep in mind that this means if people use wget to e.g. download CSV reports, their scripts will break.

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

@tsteur commented on November 25th 2020 Member

was going to move this to 4.1 as it can break things and such a change should not be in a patch release where we're trying to make Matomo 4 upgrade stable but then moved it back to 4.0.1 as we have only rolled out Matomo 4 to a few users so far. Nonetheless it's a bit risky to put it now into 4.0.1 without any notice etc. @mattab be great to mention this in the Matomo 4 changelog right away in the initial list of things.

Can you also add a developer changelog entry?

@tsteur commented on November 25th 2020 Member

@mattab

maybe below would do as well (not sure we can define multiple rules)?

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers
and https://developers.google.com/search/docs/advanced/robots/create-robots-txt
and http://www.robotstxt.org/db.html

@mattab commented on November 29th 2020 Member

I didn't realise it would break BC potentially. Then maybe we could try to instead only set the robots.txt to:

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

we'd have no guarantee it helps with the google ads malware mis-identification issue, but at least it wouldn't break BC?

@tsteur commented on November 29th 2020 Member

I have no big preference. I suppose we could always try and see if it helps? We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@mattab commented on November 29th 2020 Member

Asked the 6 people who had experienced an issue and will see if it helps them to add the simple robots.txt in this PR.

if they confirm this workaround works for them, we could

  1. consider breaking BC (risky / not great)
  2. or instead try to list in robots.txt all the google bots from the link below (so wget and any other user agents can still fetch reports), and hope it works then (or ask them again to test it, if they're willing)

We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@tsteur commented on November 30th 2020 Member

fyi tested this on the demo using https://technicalseo.com/tools/robots-txt/ and things should work like that

Added more crawlers to the list @mattab

@mattab commented on November 30th 2020 Member

LGTM

@MichaIng commented on January 8th 2021

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

That is true, such a warning shows up on each web page in Google Search Console: https://support.google.com/webmasters/answer/6352293#blocked-resources
As matomo.php/piwik.php cannot be accessed. And it breaks tracking bots in general then, I guess, which in turn renders the Bot Tracker plugin obsolete: https://plugins.matomo.org/BotTracker

It is probably reasonable to not index the Matomo web UI, but it could also be interesting to track bots 🤔.

@tsteur commented on January 10th 2021 Member

@MichaIng I suppose an easy fix be for the BotTracker plugin to delete the robots.txt regularly. Eg it can do this on plugin activation, plugin and core update, it could even do it in a regular task say every hour or daily. This would pretty much ensure the file never exists and bots can be tracked except if the file was not writable (deletable). The plugin could also mark this file to be ignore for the "file integrity check" probably so it won't complain if it doesn't exist maybe. Happy to give more hints if someone is keen on implementing this.

@MichaIng commented on January 10th 2021

Yes this is what I did, but it sounds more like a workaround than a good solution? Cleaner would be probably to solve it via Robots-Tag header set within PHP and an option in Matomo (allow/disallow bot crawling, which implies allowing/disallowing crawlers being tracked by Matomo), that can then be switched when installing the BotTracker app. I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Calling the exact same URL + query string manually works and is successfully tracked in Matomo and the cases where Google fails are a directory index and a simple HTML page without any CSS or JavaScript, aside of what Cloudflare injects to load the Piwik app into all pages.

I'm not sure how to debug this, probably we'd need a few other cases to assure it is a general issue and not limited to our and/or similar setups, or e.g. the Cloudflare app (although the query string is perfectly fine, so from that point on not sure how it could have any effect). And if it is a general issue, we'd need to ask Google community, I think.

Btw, good to know that wget respects robots.txt, I would have never guessed that 😄!

@tsteur commented on January 11th 2021 Member

I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string? As you are using Cloudflare might be good to check if caching for this endpoint is disabled? I'm not too much into Cloudflare unfortunately. Maybe it has the robots.txt cached?

@MichaIng commented on January 11th 2021

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

Since matomo.php and piwik.php are required for tracking, those would need to be allowed. index.php could be blocked.

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string?

Exactly, the crawler loads piwik.js successfully but fails to call piwik.php, exactly the same way as when blocked by robots.txt. Caching was my first idea as well, but live-testing piwik.php directly succeeds, which would fails as well if robots.txt or X-Robots-Tag did block it. But probably I've overseen something, or Google uses multiple robots.txt caches depending on how a resource is accessed, I'll keep an eye on it.

This Pull Request was closed on November 30th 2020
Powered by GitHub Issue Mirror