Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create robots.txt to prevent bots from indexing Matomo app #16795

Merged
merged 4 commits into from Nov 30, 2020
Merged

Conversation

mattab
Copy link
Member

@mattab mattab commented Nov 25, 2020

According to one user, this has helped worked around the issues with the google ads submission.

Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.

Review

  • Functional review done
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

According to one user, this has helped worked around the issues with the google ads submission. 
Similar to #6552 #15273 but as a robots.txt file as well as the meta tags.
@mattab mattab added this to the 4.0.1 milestone Nov 25, 2020
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't hurt to add that I guess

@Findus23
Copy link
Member

Just keep in mind that this means if people use wget to e.g. download CSV reports, their scripts will break.

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

@tsteur tsteur modified the milestones: 4.0.1, 4.1.0 Nov 25, 2020
@tsteur
Copy link
Member

tsteur commented Nov 25, 2020

was going to move this to 4.1 as it can break things and such a change should not be in a patch release where we're trying to make Matomo 4 upgrade stable but then moved it back to 4.0.1 as we have only rolled out Matomo 4 to a few users so far. Nonetheless it's a bit risky to put it now into 4.0.1 without any notice etc. @mattab be great to mention this in the Matomo 4 changelog right away in the initial list of things.

Can you also add a developer changelog entry?

@tsteur
Copy link
Member

tsteur commented Nov 25, 2020

@mattab

maybe below would do as well (not sure we can define multiple rules)?

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers
and https://developers.google.com/search/docs/advanced/robots/create-robots-txt
and http://www.robotstxt.org/db.html

@tsteur tsteur modified the milestones: 4.0.1, 4.0.2, 4.0.3 Nov 26, 2020
@mattab
Copy link
Member Author

mattab commented Nov 29, 2020

I didn't realise it would break BC potentially. Then maybe we could try to instead only set the robots.txt to:

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

User-agent: *
Disallow: /matomo.php
Disallow: /piwik.php

we'd have no guarantee it helps with the google ads malware mis-identification issue, but at least it wouldn't break BC?

@tsteur
Copy link
Member

tsteur commented Nov 29, 2020

I have no big preference. I suppose we could always try and see if it helps? We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@mattab
Copy link
Member Author

mattab commented Nov 29, 2020

Asked the 6 people who had experienced an issue and will see if it helps them to add the simple robots.txt in this PR.

if they confirm this workaround works for them, we could

  1. consider breaking BC (risky / not great)
  2. or instead try to list in robots.txt all the google bots from the link below (so wget and any other user agents can still fetch reports), and hope it works then (or ask them again to test it, if they're willing)

We could also add more Google bots if needed see https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

@tsteur
Copy link
Member

tsteur commented Nov 30, 2020

fyi tested this on the demo using https://technicalseo.com/tools/robots-txt/ and things should work like that

Added more crawlers to the list @mattab

@mattab
Copy link
Member Author

mattab commented Nov 30, 2020

LGTM

@tsteur tsteur merged commit 23739ca into 4.x-dev Nov 30, 2020
@tsteur tsteur deleted the robots.txt branch November 30, 2020 02:33
@MichaIng
Copy link
Contributor

MichaIng commented Jan 8, 2021

And Google shows an annoying warning in the search console that the website accessed can't be checked completely as some resources (Matomo) are blocked (at least it did years ago).

That is true, such a warning shows up on each web page in Google Search Console: https://support.google.com/webmasters/answer/6352293#blocked-resources
As matomo.php/piwik.php cannot be accessed. And it breaks tracking bots in general then, I guess, which in turn renders the Bot Tracker plugin obsolete: https://plugins.matomo.org/BotTracker

It is probably reasonable to not index the Matomo web UI, but it could also be interesting to track bots 🤔.

@tsteur
Copy link
Member

tsteur commented Jan 10, 2021

@MichaIng I suppose an easy fix be for the BotTracker plugin to delete the robots.txt regularly. Eg it can do this on plugin activation, plugin and core update, it could even do it in a regular task say every hour or daily. This would pretty much ensure the file never exists and bots can be tracked except if the file was not writable (deletable). The plugin could also mark this file to be ignore for the "file integrity check" probably so it won't complain if it doesn't exist maybe. Happy to give more hints if someone is keen on implementing this.

@MichaIng
Copy link
Contributor

MichaIng commented Jan 10, 2021

Yes this is what I did, but it sounds more like a workaround than a good solution? Cleaner would be probably to solve it via Robots-Tag header set within PHP and an option in Matomo (allow/disallow bot crawling, which implies allowing/disallowing crawlers being tracked by Matomo), that can then be switched when installing the BotTracker app. I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Calling the exact same URL + query string manually works and is successfully tracked in Matomo and the cases where Google fails are a directory index and a simple HTML page without any CSS or JavaScript, aside of what Cloudflare injects to load the Piwik app into all pages.

I'm not sure how to debug this, probably we'd need a few other cases to assure it is a general issue and not limited to our and/or similar setups, or e.g. the Cloudflare app (although the query string is perfectly fine, so from that point on not sure how it could have any effect). And if it is a general issue, we'd need to ask Google community, I think.

Btw, good to know that wget respects robots.txt, I would have never guessed that 😄!

@tsteur
Copy link
Member

tsteur commented Jan 11, 2021

I guess making it more fine grained and block only files that are not required for loading the tracking js by default wouldn't block much that is not blocked by .htaccess/webserver or authentication anyway, right?

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

But even that I removed the robots.txt and assured that Google crawler is able to check and index piwik.php and matomo.php, it still fails load it with tracking query string when crawling other files:

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string? As you are using Cloudflare might be good to check if caching for this endpoint is disabled? I'm not too much into Cloudflare unfortunately. Maybe it has the robots.txt cached?

@MichaIng
Copy link
Contributor

MichaIng commented Jan 11, 2021

We could maybe only block index.php, matomo.php and piwik.php but it be basically the same as like now pretty much.

Since matomo.php and piwik.php are required for tracking, those would need to be allowed. index.php could be blocked.

Sorry not quite understanding this part. So you removed robots.txt but then Google crawler still fails to access it when there is a tracking query string?

Exactly, the crawler loads piwik.js successfully but fails to call piwik.php, exactly the same way as when blocked by robots.txt. Caching was my first idea as well, but live-testing piwik.php directly succeeds, which would fails as well if robots.txt or X-Robots-Tag did block it. But probably I've overseen something, or Google uses multiple robots.txt caches depending on how a resource is accessed, I'll keep an eye on it.


EDIT: The issue persists and neither Matomo nor the webserver nor PHP report any error. Probably that mobile friendly test tool or the Google crawler itself simply denies to access resources with such long query string, not sure, it doesn't give any information anywhere. I tried to enter the whole URL with query string into Search Console URL inspection but it won't do the test, but as well doesn't show any reason why, neither do the related help/doc pages give any hint.

@remz-otw
Copy link

@MichaIng I run into the same issue
Eeither in search console or in mobile friendly test tool failled to load ressources of matomo.domain.com/matomo.php?...
Matomo reporting only 20 robots request with bot tracker module and at the same time my server side statistic report more than 300 request from GoogleBot

This issue start when i update to matomo 4
I've now update to 4.3 release with robots.txt update to allow tracking files

Is a drupal site web with matomo drupal module (If that information help)
I've search all arround and find nothing anywhere seems that few people run into this issue. Don't hesitate to ask for more information of my config.

@MichaIng
Copy link
Contributor

I've opened a topic in the forum, hoping for more reports to compare with there: https://forum.matomo.org/t/google-and-bing-crawlers-fail-to-call-the-tracker-endpoint/41760?u=michaing
Not sure whether it's better to open a new issue here on GitHub?

Was it just recently that it worked with Matomo v3? That would be a good hint that it is related to a change in Matomo v4 and not a change in search machine crawlers.

@remz-otw
Copy link

I can't be sure it is related to matomo 4 is just i realise it since the update Sorry

@remz-otw
Copy link

Are you using Drupal module ? Is this could be related ?

@MichaIng
Copy link
Contributor

MichaIng commented May 17, 2021

Let's go on with this here: #17572

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants