New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-allow tracking bots #17497
Comments
Hi @MichaIng it did affect actually quite a few users which is why it was added. Generally though, there's a wildcard configured for the tracking endpoints: https://github.com/matomo-org/matomo/blob/4.3.0-b3/robots.txt#L30-L32 and things should work (in theory). Is it maybe that you use a different endpoint for tracking? I see we could for example also add |
The wildcard + Disallow means that all clients who respect the The idea of the The following would be required to all access to tracking endpoints while blocking access for search engine crawlers to everything else.
But that would still now be sufficient, as in the first place
|
Thanks for the reply @MichaIng . I was reading the htaccess wrong and it is blocking all bots. I'll try follow up with this internally next week or the week after and get back to you. We may need to have some kind of option for this and re-evaluate whether it actually helped for Google ads submission or not. |
Moving this for now into 4.3 milestone but there's nothing to be done just yet. We need to discuss what could be options to make this work again. |
We would for now remove the rules for |
Many thanks. matomo.js and piwik.js need to be allowed as well, AFAIK. Will test it. |
@MichaIng there shouldn't be a need for the bots to access these |
@tsteur I think the .js files are needed for the bots that execute JavaScript if you want them to also execute Matomo JavaScript and track them... |
But the JavaScript is what is added to the document via tracking code. And the JavaScript then performs the actual tracker call:
Results in:
Which performs the matomo.php call, or do I misunderstand the logic? Now I wonder why the short code snippet does not do the matomo.php call directly, like it is for non-JS clients:
But at least the Google crawler loads scripts. If blocked by the robots.txt, the mobile-friendly test does not show a matomo.php access error anymore but a matomo.js error instead, which indicates that it is loaded first. |
@mattab that could be actually a good thing as then also the accessible matomo.php wouldn't be an issue maybe. it be just like someone with an ad blocker etc. We can add it if really needed but I reckon it could be a good compromise between making tracking bots work from the importer script and not having the crawlers access the JS. But we can add it for people that want to track bots without the importer. |
Well, as a workaround, people could always go in their robots.txt and manually delete the matomo|piwik.js lines? So i suppose it's fine this way as well |
You mean calling
Definitely, but it isn't great when one needs to do this manually after every Matomo update. I apply beta versions as well, that sums up to regular SSH access to remove/alter that Btw, could someone give me a link regarding the Google ads submission issue? I didn't understand yet how blocking bot access can have an effect there. The linked issues I found are about preventing search engines from indexing the Matomo login page, which is totally reasonable but would require to block |
@MichaIng we've added the tracker files for now to the allow list so it just all just work. |
Many thanks! I'll give it a try. Btw, the long user agent list is still required to allow e.g. CSV report downloads via |
@MichaIng give it a go and let us know if it doesn't work. wget user agent should not be blocked. |
Confirmed it's working fine now.
Yes that is clear. The idea was to simplify the |
Summary
With #16795 a
robots.txt
has been implemented which breaks tracking of bots. Tracking bots to derive the sites and frequency of search machine crawlers is valuable information, hence bots should by default be tracked like any other visitor. There is even a plugin available to track bots separately based on user agent, which totally lost its purpose with therobots.txt
. I did not fully understand the reason for #16795, but it seems to be based on a single user having issues with google ads submission. However, in case of a single user, I'd say a custom solution makes more sense, than breaking a generally useful feature for all users with a file that is installed and re-created on every Matomo update, which hence cannot be avoided without regular manual interaction.What makes sense, is preventing the Matomo login page from being indexed, as this is likely never meant to be public, but that is the case already via meta tag, which is generally the better way to do it (preventing crawlers from having to read an additional file every time): https://github.com/matomo-org/matomo/blob/4.x-dev/plugins/Login/templates/loginLayout.twig#L3-L5
In case of Google and Bing, crawlers are still not tracked after removing the
robots.txt
, failing on the tracker PHP request. But I didn't investigate this much further without having a confirmation first that tracking bots is generally a wanted feature, which in terms means to remove therobots.txt
that breaks it.The text was updated successfully, but these errors were encountered: