Suggestion: Unblock crawls by robots.txt #18175

soratako · 2021-10-18T15:52:03Z

Hi!

I received an email notification from the Google Search Console. The email says "Indexed, though blocked by robots.txt".

So I checked the source code of matomo. It seems that crawl is blocked in robots.txt in addition to "robots noindex" in the head tag of the page.

According to this help page, we only need to add tags to block indexing. The help page also says "Don't use a robots.txt file as a means to hide your web pages from Google search results."

So I suggest removing the crawl blocking specification from robots.txt.

These changes were made by #16795. It seems that they changed it with the understanding that Google crawlers will not be able to crawl. This method prevents Google from crawling, but it does show up in Google search results.

tsteur · 2021-10-18T18:36:00Z

Hi @soratako we currently don't have any plans to change this. Is there any particular reason you would want your Matomo to be crawled by search engines? Generally it shouldn't appear in the search results anyway.

soratako · 2021-10-18T22:29:41Z

@tsteur Is that true? So why? After I received the email, I searched for my domain (a subdomain for matomo like aaa.example.com) and it appeared in Google search results. However, it is only the title of the page and does not display the contents of the page. If you're right, this may be due to other causes. Anyway, I don't want it to appear in search results. Is there any other way to solve this?

tsteur · 2021-10-18T23:09:10Z

Not sure why it would appear @soratako The robots.txt says to not index and as you noticed the meta headers <meta name="robots" content="noindex,nofollow"> is set too.

Do you think anything else is missing? Or do you mean the robots.txt shouldn't be there so it can see the meta tag?

soratako · 2021-10-19T16:53:01Z

@tsteur Hmm. As far as I read the help page, robots.txt requires the crawler not to access the site. It doesn't say "don't index the site". As a result of using robots.txt, Google crawler cannot access the page and cannot read the meta headers <meta name="robots" content="noindex,nofollow">. Therefore, it ignores the meta tag and indexes the site. If we just want the site not to appear in search results, we don't need to use robots.txt.

soratako · 2021-10-19T17:25:27Z

For example, if you search for demo.matomo.cloud on Google, the matomo site will be displayed in the search results. https://www.google.com/search?q=demo.matomo.cloud This site also has robots.txt and meta tag<meta name="robots" content="noindex,nofollow">.

tsteur · 2021-10-20T03:51:52Z

OK thanks for this. @sgiehl do you have any thoughts there?

#16795 was generally added to workaround a Google Ads issue. It's not clear if removing the robots.txt could regress something or not. CC @mattab

generally the problem seems to be that robots.txt doesn't prevent you from being listed on the search as it cannot find the meta=noindex because it's blocked to request the site

sgiehl · 2021-10-20T14:48:51Z

I'm not able to say anything about possible site effects with Google Ads.
I guess we shouldn't remove the robots.txt completely, as it still should block access to static files that can't be "protected" otherwise.
Also if we allow access to files like matomo.js or matomo.php I doubt a domain would be removed from the search index anyway.
For content sent by our software we could also consider to send a header like X-Robots-Tag: noindex, nofollow (See https://developers.google.com/search/docs/advanced/robots/robots_meta_tag?hl=en)

Cyriuz · 2022-09-01T11:19:01Z

I have the same issue, did anyone try removing the robots.txt?

verybigelephants · 2023-08-03T08:12:16Z

please excuse me for ressurecting such an old topic, but i have just recently noticed the mamoto comes with robots.txt that allows indexing mamoto.php and some js files

why?

shouldn't it be just this?

User-agent: *
Disallow: /

sgiehl · 2023-08-03T08:16:08Z

Matomo can be configured to track bots as well. By default the tracker will discard requests from bots, though.

verybigelephants · 2023-08-03T08:17:55Z

oh ok! thank you for the explanation

soratako added the Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. label Oct 18, 2021

sgiehl added Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. and removed Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. labels Nov 10, 2022

sgiehl added this to the For Prioritization milestone Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Unblock crawls by robots.txt #18175

Suggestion: Unblock crawls by robots.txt #18175

soratako commented Oct 18, 2021

tsteur commented Oct 18, 2021

soratako commented Oct 18, 2021 •

edited

tsteur commented Oct 18, 2021

soratako commented Oct 19, 2021 •

edited

soratako commented Oct 19, 2021

tsteur commented Oct 20, 2021

sgiehl commented Oct 20, 2021

Cyriuz commented Sep 1, 2022

verybigelephants commented Aug 3, 2023

sgiehl commented Aug 3, 2023

verybigelephants commented Aug 3, 2023

Suggestion: Unblock crawls by robots.txt #18175

Suggestion: Unblock crawls by robots.txt #18175

Comments

soratako commented Oct 18, 2021

tsteur commented Oct 18, 2021

soratako commented Oct 18, 2021 • edited

tsteur commented Oct 18, 2021

soratako commented Oct 19, 2021 • edited

soratako commented Oct 19, 2021

tsteur commented Oct 20, 2021

sgiehl commented Oct 20, 2021

Cyriuz commented Sep 1, 2022

verybigelephants commented Aug 3, 2023

sgiehl commented Aug 3, 2023

verybigelephants commented Aug 3, 2023

soratako commented Oct 18, 2021 •

edited

soratako commented Oct 19, 2021 •

edited