New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create robots.txt to prevent bots from indexing Matomo app #16795
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't hurt to add that I guess
Just keep in mind that this means if people use wget to e.g. download CSV reports, their scripts will break. And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago). |
was going to move this to 4.1 as it can break things and such a change should not be in a patch release where we're trying to make Matomo 4 upgrade stable but then moved it back to 4.0.1 as we have only rolled out Matomo 4 to a few users so far. Nonetheless it's a bit risky to put it now into 4.0.1 without any notice etc. @mattab be great to mention this in the Matomo 4 changelog right away in the initial list of things. Can you also add a developer changelog entry? |
maybe below would do as well (not sure we can define multiple rules)?
see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers |
I didn't realise it would break BC potentially. Then maybe we could try to instead only set the robots.txt to:
we'd have no guarantee it helps with the google ads malware mis-identification issue, but at least it wouldn't break BC? |
I have no big preference. I suppose we could always try and see if it helps? We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers |
Asked the 6 people who had experienced an issue and will see if it helps them to add the simple robots.txt in this PR. if they confirm this workaround works for them, we could
|
fyi tested this on the demo using https://technicalseo.com/tools/robots-txt/ and things should work like that Added more crawlers to the list @mattab |
LGTM |
That is true, such a warning shows up on each web page in Google Search Console: https://support.google.com/webmasters/answer/6352293#blocked-resources It is probably reasonable to not index the Matomo web UI, but it could also be interesting to track bots 🤔. |
@MichaIng I suppose an easy fix be for the BotTracker plugin to delete the robots.txt regularly. Eg it can do this on plugin activation, plugin and core update, it could even do it in a regular task say every hour or daily. This would pretty much ensure the file never exists and bots can be tracked except if the file was not writable (deletable). The plugin could also mark this file to be ignore for the "file integrity check" probably so it won't complain if it doesn't exist maybe. Happy to give more hints if someone is keen on implementing this. |
Yes this is what I did, but it sounds more like a workaround than a good solution? Cleaner would be probably to solve it via Robots-Tag header set within PHP and an option in Matomo (allow/disallow bot crawling, which implies allowing/disallowing crawlers being tracked by Matomo), that can then be switched when installing the BotTracker app. I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by But even that I removed the Calling the exact same URL + query string manually works and is successfully tracked in Matomo and the cases where Google fails are a directory index and a simple HTML page without any CSS or JavaScript, aside of what Cloudflare injects to load the Piwik app into all pages. I'm not sure how to debug this, probably we'd need a few other cases to assure it is a general issue and not limited to our and/or similar setups, or e.g. the Cloudflare app (although the query string is perfectly fine, so from that point on not sure how it could have any effect). And if it is a general issue, we'd need to ask Google community, I think. Btw, good to know that |
We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.
Sorry not quite understanding this part. So you removed |
Since
Exactly, the crawler loads EDIT: The issue persists and neither Matomo nor the webserver nor PHP report any error. Probably that mobile friendly test tool or the Google crawler itself simply denies to access resources with such long query string, not sure, it doesn't give any information anywhere. I tried to enter the whole URL with query string into Search Console URL inspection but it won't do the test, but as well doesn't show any reason why, neither do the related help/doc pages give any hint. |
@MichaIng I run into the same issue This issue start when i update to matomo 4 Is a drupal site web with matomo drupal module (If that information help) |
I've opened a topic in the forum, hoping for more reports to compare with there: https://forum.matomo.org/t/google-and-bing-crawlers-fail-to-call-the-tracker-endpoint/41760?u=michaing Was it just recently that it worked with Matomo v3? That would be a good hint that it is related to a change in Matomo v4 and not a change in search machine crawlers. |
I can't be sure it is related to matomo 4 is just i realise it since the update Sorry |
Are you using Drupal module ? Is this could be related ? |
Let's go on with this here: #17572 |
According to one user, this has helped worked around the issues with the google ads submission.
Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.
Review