Re-allow tracking bots #17497

MichaIng · 2021-04-24T14:37:11Z

Summary

With #16795 a robots.txt has been implemented which breaks tracking of bots. Tracking bots to derive the sites and frequency of search machine crawlers is valuable information, hence bots should by default be tracked like any other visitor. There is even a plugin available to track bots separately based on user agent, which totally lost its purpose with the robots.txt. I did not fully understand the reason for #16795, but it seems to be based on a single user having issues with google ads submission. However, in case of a single user, I'd say a custom solution makes more sense, than breaking a generally useful feature for all users with a file that is installed and re-created on every Matomo update, which hence cannot be avoided without regular manual interaction.

What makes sense, is preventing the Matomo login page from being indexed, as this is likely never meant to be public, but that is the case already via meta tag, which is generally the better way to do it (preventing crawlers from having to read an additional file every time): https://github.com/matomo-org/matomo/blob/4.x-dev/plugins/Login/templates/loginLayout.twig#L3-L5

In case of Google and Bing, crawlers are still not tracked after removing the robots.txt, failing on the tracker PHP request. But I didn't investigate this much further without having a confirmation first that tracking bots is generally a wanted feature, which in terms means to remove the robots.txt that breaks it.

The text was updated successfully, but these errors were encountered:

tsteur · 2021-04-26T20:50:06Z

Hi @MichaIng it did affect actually quite a few users which is why it was added. Generally though, there's a wildcard configured for the tracking endpoints: https://github.com/matomo-org/matomo/blob/4.3.0-b3/robots.txt#L30-L32 and things should work (in theory). Is it maybe that you use a different endpoint for tracking? I see we could for example also add js/index.php and js/tracker.php.

MichaIng · 2021-04-26T21:02:34Z

The wildcard + Disallow means that all clients who respect the robots.txt will not read the endpoints.

The idea of the robots.txt text was that the tracking endpoints shall be blocked for all bots (hence the wildcard) while only known search machines crawlers shall not crawl any of the contained files. This was required as otherwise wget fails to access anything (e.g. reports) as well.

The following would be required to all access to tracking endpoints while blocking access for search engine crawlers to everything else.

User-agent: *
Allow: /matomo.php
Allow: /piwik.php

But that would still now be sufficient, as in the first place piwik.js/matomo.js need to be allowed and all other files that might be indirectly invoked by those scripts:

User-agent: *
Allow: /matomo.js
Allow: /matomo.php
Allow: /piwik.js
Allow: /piwik.php

tsteur · 2021-04-26T22:31:26Z

Thanks for the reply @MichaIng . I was reading the htaccess wrong and it is blocking all bots. I'll try follow up with this internally next week or the week after and get back to you. We may need to have some kind of option for this and re-evaluate whether it actually helped for Google ads submission or not.

tsteur · 2021-04-26T22:32:08Z

Moving this for now into 4.3 milestone but there's nothing to be done just yet. We need to discuss what could be options to make this work again.

tsteur · 2021-05-09T22:34:58Z

We would for now remove the rules for matomo.php and piwik.php and see if it causes issues again re Google.

fix #17497

MichaIng · 2021-05-10T19:59:55Z

Many thanks. matomo.js and piwik.js need to be allowed as well, AFAIK. Will test it.

tsteur · 2021-05-10T20:10:01Z

@MichaIng there shouldn't be a need for the bots to access these

mattab · 2021-05-10T20:33:58Z

@tsteur I think the .js files are needed for the bots that execute JavaScript if you want them to also execute Matomo JavaScript and track them...

MichaIng · 2021-05-10T20:38:20Z

But the JavaScript is what is added to the document via tracking code. And the JavaScript then performs the actual tracker call:

<!-- Matomo -->
<script type="text/javascript">
  var _paq = window._paq = window._paq || [];
  /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
  _paq.push(['trackPageView']);
  _paq.push(['enableLinkTracking']);
  (function() {
    var u="https://domain.com/matomo/";
    _paq.push(['setTrackerUrl', u+'matomo.php']);
    _paq.push(['setSiteId', '1']);
    var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
    g.type='text/javascript'; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
  })();
</script>
<!-- End Matomo Code -->

Results in:

<script async src="https://domain.com/matomo/matomo.js"></script>

Which performs the matomo.php call, or do I misunderstand the logic?

Now I wonder why the short code snippet does not do the matomo.php call directly, like it is for non-JS clients:

<noscript><p><img src="https://domain.com/matomo/matomo.php?idsite=1&amp;rec=1" style="border:0;" alt="" /></p></noscript>

But at least the Google crawler loads scripts. If blocked by the robots.txt, the mobile-friendly test does not show a matomo.php access error anymore but a matomo.js error instead, which indicates that it is loaded first.

tsteur · 2021-05-10T20:43:02Z

@tsteur I think the .js files are needed for the bots that execute JavaScript if you want them to also execute Matomo JavaScript and track them...

@mattab that could be actually a good thing as then also the accessible matomo.php wouldn't be an issue maybe. it be just like someone with an ad blocker etc.

We can add it if really needed but I reckon it could be a good compromise between making tracking bots work from the importer script and not having the crawlers access the JS. But we can add it for people that want to track bots without the importer.

refs #17497

mattab · 2021-05-11T03:50:23Z

Well, as a workaround, people could always go in their robots.txt and manually delete the matomo|piwik.js lines? So i suppose it's fine this way as well

MichaIng · 2021-05-11T10:15:42Z

tracking bots work from the importer script

You mean calling matomo.php directly from the HTML <script> snippet, like the <noscript> variant does? If so, how would that look like (as the admin interface only suggests the above pasted variant)? I recognised that the matomo.php call from matomo.js has a long query string which passes client info, while the suggested <noscript> call passes the site ID only. Does matomo.php gather client info then from what the webserver passes or is the info not complete?

Well, as a workaround, people could always go in their robots.txt and manually delete the matomo|piwik.js lines?

Definitely, but it isn't great when one needs to do this manually after every Matomo update. I apply beta versions as well, that sums up to regular SSH access to remove/alter that robots.txt to not cause contradicting info for crawlers. So if we find a way to prevent crawlers from indexing the Matomo login page without breaking either bot tracking or Google ads submission, that would be awesome.

Btw, could someone give me a link regarding the Google ads submission issue? I didn't understand yet how blocking bot access can have an effect there. The linked issues I found are about preventing search engines from indexing the Matomo login page, which is totally reasonable but would require to block index.php only (AFAIK, as non-HTML resources are not indexed anyway). All sub directories have a .htaccess to prevent access, so that is covered as well.

tsteur · 2021-05-11T19:52:32Z

@MichaIng we've added the tracker files for now to the allow list so it just all just work.

MichaIng · 2021-05-11T20:07:30Z

Many thanks! I'll give it a try.

Btw, the long user agent list is still required to allow e.g. CSV report downloads via wget, right? I wonder if it would be easier (and safer) to apply the rules to all User-agent: * and then grant report access to wget specifically? Not sure if there are other download tools which respect robots.txt, but there are definitely a lot more crawler/bot user agents, hence the list is difficult to maintain.

tsteur · 2021-05-11T20:17:15Z

@MichaIng give it a go and let us know if it doesn't work. wget user agent should not be blocked.

MichaIng · 2021-05-14T13:00:36Z

Confirmed it's working fine now.

wget user agent should not be blocked.

Yes that is clear. The idea was to simplify the robots.txt by not including a long and hard to maintain list of known bot/crawler user agents but excluding known download tool user agents instead, which is for now a single one (wget) only. Otherwise I can add another long list of known crawlers which are missing currently. But it's really a different topic and it's not a big issue as long as the most common search engines are covered, IMO. Evil crawlers do not respect robots.txt anyway 😉.

MichaIng added the Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. label Apr 24, 2021

tsteur modified the milestones: 4.4.0, 4.3.0 Apr 26, 2021

tsteur added Regression Indicates a feature used to work in a certain way but it no longer does even though it should. and removed Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. labels Apr 26, 2021

tsteur added a commit that referenced this issue May 9, 2021

Allow requests to piwik and matomo php again for bots

48d0507

fix #17497

tsteur mentioned this issue May 9, 2021

Allow requests to piwik and matomo php again for bots #17536

Merged

10 tasks

tsteur closed this as completed in #17536 May 10, 2021

tsteur added a commit that referenced this issue May 10, 2021

Allow tracker files to be accessed by crawlers

4c46bfe

refs #17497

tsteur mentioned this issue May 10, 2021

Allow tracker files to be accessed by crawlers #17542

Merged

10 tasks

tsteur added a commit that referenced this issue May 11, 2021

Allow tracker files to be accessed by crawlers (#17542)

e812566

refs #17497

MichaIng mentioned this issue May 17, 2021

Many bots fail to call the tracker PHP endpoints #17572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-allow tracking bots #17497

Re-allow tracking bots #17497

MichaIng commented Apr 24, 2021

tsteur commented Apr 26, 2021

MichaIng commented Apr 26, 2021 •

edited

tsteur commented Apr 26, 2021

tsteur commented Apr 26, 2021

tsteur commented May 9, 2021

MichaIng commented May 10, 2021

tsteur commented May 10, 2021

mattab commented May 10, 2021

MichaIng commented May 10, 2021

tsteur commented May 10, 2021 •

edited

mattab commented May 11, 2021

MichaIng commented May 11, 2021

tsteur commented May 11, 2021

MichaIng commented May 11, 2021 •

edited

tsteur commented May 11, 2021

MichaIng commented May 14, 2021

Re-allow tracking bots #17497

Re-allow tracking bots #17497

Comments

MichaIng commented Apr 24, 2021

Summary

tsteur commented Apr 26, 2021

MichaIng commented Apr 26, 2021 • edited

tsteur commented Apr 26, 2021

tsteur commented Apr 26, 2021

tsteur commented May 9, 2021

MichaIng commented May 10, 2021

tsteur commented May 10, 2021

mattab commented May 10, 2021

MichaIng commented May 10, 2021

tsteur commented May 10, 2021 • edited

mattab commented May 11, 2021

MichaIng commented May 11, 2021

tsteur commented May 11, 2021

MichaIng commented May 11, 2021 • edited

tsteur commented May 11, 2021

MichaIng commented May 14, 2021

MichaIng commented Apr 26, 2021 •

edited

tsteur commented May 10, 2021 •

edited

MichaIng commented May 11, 2021 •

edited