@MichaIng opened this Issue on May 17th 2021

Now that #17497 has been done, there is another issue that most bots, especially the important Google and Bing bots, fail to call the tracker PHP endpoints matomo.php/piwik.php, while generally robots.txt and headers do allow it and some bots are successfully tracked.

Matomo, PHP and webserver do not report any errors, so the request seems to fail right at the crawler, probably related to a too long query string or invalid characters? Example URL shown by Google mobile-friendly test:

https://our.domain.com/matomo/piwik.php?action_name=<site_title>&idsite=1&rec=1&r=525912&h=7&m=3&s=1&url=<site_url>%2F&_id=d0bed9fe69afde68&_idn=1&_refts=0&send_image=0&cookie=1&res=412x732&pv_id=DAfyXy&pf_net=0&pf_srv=2&pf_tfr=0&pf_dm1=20

Further details:

Expected Behavior

All bots that crawl the website are tracked and visible in Matomo.

Current Behavior

Many bots fail to call the tracker PHP endpoints and hence do not appear in Matomo.

Possible Solution

None found so far.

Steps to Reproduce (for Bugs)

  1. Upgrade to latest Matomo v4.3 release candidate
  2. Install BotTracker app
  3. Watch AhrefsBot, Baiduspider, YandexBot and Wget being tracked
  4. No other bot is tracked, despite respective accesses.
  5. For Google crawler, the mobile-friendly test test reveals that the crawler is trying to perform the correct request (URL) but fails to do so with "Other error". Manually calling the same URL succeeds and shows a related Matomo visit/page view.

Context

The behaviour of search engine crawlers, which pages they crawl, how often etc, can be important to detect issues, optimise crawler settings/schedules, etc to balance server load and optimise SEO.

Your Environment

  • Matomo Version: 4.3.0-rc2
  • PHP Version: 8.0.5
  • Server Operating System: Debian Bullseye
  • Additionally installed plugins:
    API, Actions, Annotations, BotTracker 2.01, BulkTracking, Contents, CoreAdminHome, CoreConsole, CoreHome, CorePluginsAdmin, CoreUpdater, CoreVisualizations, CustomJsTracker, DBStats, DarkTheme 1.1.6, Dashboard, DevicePlugins, DevicesDetection, Diagnostics, Goals, Heartbeat, ImageGraph, Insights, Installation, Intl, LanguagesManager, Live, LogViewer 4.0.1, Login, Marketplace, Monolog, Morpheus, Overlay, PagePerformance, PrivacyManager, Proxy, Referrers, Resolution, SEO, SegmentEditor, SitesManager, Transitions, UserLanguage, UsersManager, VisitFrequency, VisitTime, VisitorInterest, VisitsSummary, WebsiteMeasurable
@MichaIng commented on May 17th 2021

@remz-otw
I don't use Drupal. So while it looks like it's related to Matomo v4, we are not 100% certain. As you have access logs, can you verify that the matomo.js request is done by the Google bot, but the expected following matomo.php request is not done at all (does not reach the server)?

@sgiehl commented on May 17th 2021 Member

@MichaIng Most bots are not able to load the javascript. Google might maybe ignore it to avoid unwanted tracking requests.
If you want to fully track all bots, you should not use javascript tracking. It would be better to use log importing for this.

@remz-otw commented on May 17th 2021

@sgiehl I agree but since it is an error an not a good thing in a SEO point of vue
And put an img tag
<noscript><p><img src="https://mymatomodomain.com/matomo.php?idsite=1&amp;rec=1&amp;bots=1" style="border:0;" alt="" /></p></noscript>
And then no bot tracked

@MichaIng commented on May 17th 2021

Google and Bing both load the JavaScript for sure, otherwise the matomo.php/piwik.php request with fully developed query string (see above) would not be done in the first place. So that is not the issue.

Probably skipping JavaScript would help regardless, but I use the Cloudflare app to enable tracking on all our pages and there is no choice possible currently. Out of interest, how would the HTML code look like to have PHP tracking done without the JavaScript?


Okay, I see the possibility of tracking via image source, although wrapping it into <noscript> means that the important bots do not process it, as they do load JavaScript. Since the long query string, that is built by the JavaScript, is missing when using the image source method, is still all client info contained with this?

@sgiehl commented on May 17th 2021 Member

Does a request from Google bot to the PHP endpoint reach your server? Just want to make sure they are actually performed by the bots. If they are listed in your access log, then it might be an issue with Matomo or maybe the BotTracker plugin. For me it sounds like Google does not perform the request for any reason. If that is the case you could try to play around with some settings that change the request, like disabling sendBeacon or forcing a POST request maybe.

@remz-otw commented on May 17th 2021

@MichaIng I run into my log i got an example of Googlebot requesing motomo.js but not the matomo.php
666.249.70.23 mymatomodomain.com - [16/May/2021:18:46:21 +0200] "GET /matomo.js HTTP/1.1" 200 20456 "https://mymatomodomain.com/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/90.0.4430.97 Safari/537.36"

EDIT: I didn't find any request for matomo.php from Googlebot

@MichaIng commented on May 17th 2021

Okay, so we need to find out why Google and most other bots do not perform this final PHP request then. I wonder if it's done when using an image-based request without wrapping it into <noscript>, like:

<img src="https://domain.com/matomo.php?idsite=1&amp;rec=1 style="border:0;" alt="" />

I'll create a test page with this and run it through the mobile-friendly test.

@remz-otw commented on May 17th 2021

Ok i Add _paq.push(['setRequestMethod', 'GET']); and remove the noscript tag and the in search console it load both ressources correcly

But then bots still not tracked
EDIT I was wrong bots are tracked Wouhou !

@remz-otw commented on May 17th 2021

Now when i use "mobile friendly test tool" or "url inspection tool" from search console and even lighthouse it get catch by matomo
For lighthouse and mobile friendly tool bot tracker don't catch it it's mark "operating system: bot"

@MichaIng commented on May 17th 2021

Jep, using the raw image source tracker works here as well (mobile-friendly test, done three times, tracked three times):

So I suspect the ~long query string~ to be the issue, or probably doing requests originating from JavaScript?
EDIT: Used the same URL above that was generated by the JavaScript and fails there, and added it as image source: Now the Google bot performs the request and gets tracked.

@remz-otw commented on May 17th 2021

I don't think is cause by the long query string cause it's work now with long query string
In my case the only change in the request is &send_image=1 since before it was set to 0

@MichaIng commented on May 17th 2021

Jep, I just tested it as well and the exact same URL that fails from JavaScript succeeds from the image.

@sgiehl commented on May 17th 2021 Member

Matomo uses sendBeacon by default (if the browser supports it). Maybe such requests are blocked.

@remz-otw commented on May 17th 2021

Did you try to change the javascrpit code and add
_paq.push(['setRequestMethod', 'GET']);
it work for me
But i don't know what that involve
But since then the request is with send_image=1

@remz-otw commented on May 17th 2021

@MichaIng Maybe try the java request with send_image=1 to test if it is related

@sgiehl commented on May 17th 2021 Member

_paq.push(['setRequestMethod', 'GET']);

Forcing GET as request method automatically disables the use of sendBeacon as it uses POST only.

@MichaIng commented on May 17th 2021

Btw, if the BotTracker app is disabled, are bots supposed to be tracked as regular visitors, how are they dropped? I just disabled it and the mobile-friendly test still reports the error and I cannot see a related visitor log. So I guess we can rule out the app to be the culprit.

But I see you're already further with GET/sendBeacon 👍.

Since I use the Cloudflare app, I cannot change the script snipped and cannot add parameters to the request for production. But I'll test it as additional code in the test page.

@remz-otw commented on May 17th 2021

But then now it work with sendbeacon disable but what is the drowback of such config ? If in matomo 4 it on by default it should be for a good reason ? Less Acurate tracking or less good user experience ?

@MichaIng commented on May 17th 2021

I can verify that it works with _paq.push(['setRequestMethod', 'GET']);. Is there a way to keep using POST requests but disable sendBeacon only, to check whether it's probably POST requests in general, not sendBeacon in particular?

@sgiehl commented on May 17th 2021 Member

Matomo by default discards all visitors that are recognized as bots here:
https://github.com/matomo-org/matomo/blob/df68fbce2397570df2d5fee6c1b56db57241a10b/core/Tracker/VisitExcluded.php#L189-L197

So when the user agent is detected as bot or within certain IP ranges its sorted out unless the parameter bots=1 is set.
You can achieve that in the tracking code by adding
_paq.push(['appendToTrackingUrl', 'bots=1']);

But to be clear here: Tracking bots within Matomo actually will ruin most of the reports. For most websites it's relevant to know how many visitors a website has and how they perform. Counting bots might explode those numbers and make them meaningless or unreliable. In most cases you may not even be able to distinguish between normal visits and bots within Matomo.

To only disable sendBeacon you could use:
_paq.push(['disableAlwaysUseSendBeacon']);

@remz-otw commented on May 17th 2021

@MichaIng I've tested to just deactivate sendbeacon (_paq.push(['disableAlwaysUseSendBeacon']);) and it's work fine

@sgiehl I agree with you but since it is catch by bot tracker it shouldn't be an issue no ?

@MichaIng commented on May 17th 2021

Great, yes it makes totally sense to not count bots by default, and the BotTracker app then has a reasonable purpose to enable it, but outside or regular metrics, as an extra feature.

_paq.push(['disableAlwaysUseSendBeacon']); alone works as well, so we found the culprit indeed 👍.

@sgiehl commented on May 17th 2021 Member

BotTracker uses it's own detection for bots as far as I know. So it might happen that Matomo would have ignored a request but BotTracker doesn't detect it as bot. https://github.com/Thomas--F/BotTracker/issues/60 would fix this

@MichaIng commented on May 17th 2021

At least it detects bots based on the user agent, so when Matomo drops a bot with a user agent that has not been added to the BotTracker list (it can be edited/extended), then it wouldn't be tracked at all. But all important ones are part of the default list.

Okay, so while I cannot alter the invocation snipped in the Cloudflare app, I can disable sendBeacon by manually editing the local Matomo code. But indeed the question is what the downsides are.

Reading: https://developer.mozilla.org/docs/Web/API/Navigator/sendBeacon
So in short, using sendBeacon, the request is sent asynchronously to reduce the chance that the user navigates to another page already, unloading the page, while analytics have not been send yet. That is especially an issue when the script is loaded at the end of the page view, also with defer flag.

In my case, with Cloudflare app, the script is in the page head without defer flag, so it shouldn't be an issue in this particular case. But usually you don't want such scripts to defer the visual/functional page load, where the issue is more apparent. Workarounds are listed in the docs above.

@MichaIng commented on May 17th 2021

My personal workaround for now: In piwik.js, replace if(da==="GET"){this.disableAlwaysUseSendBeacon()} with this.disableAlwaysUseSendBeacon().

Best would be actually if this was only done for bots. I had a look into the BotTracker source code if there is a way set it there, but there is nothing that would affect those flags prior to the actual tracking request. The CustomJsTracker plugin should generally enable that, but I guess again this cannot be done based on user agent, as the user agent is derived AFTER the tracker request has been done? Chicken and egg it seems to me 😄.

@remz-otw commented on May 17th 2021

I've test remove bots=1 from the request and still i got Google Bot tracked by Bot Tracker.
This mean that Bot tracker catch the info before matomo drop it this is a good thing

@MichaIng commented on May 17th 2021

This is how I understand it:

  • When using bots=1, Matomo core tracks bots, but like regular users, which is not what admins want in most cases.
  • BotTracker seems to catch exactly the cases which are (by default) excluded by Matomo, by registering to the Tracker.isExcludedVisit event: https://github.com/Thomas--F/BotTracker/blob/master/BotTracker.php#L93
    It compares the user agent of these cases against the list and stores the visit in its own database tables, if the agent matches one of the list.
  • If BotTracker worked in your cases with bots=1, then I guess the event is triggered regardless and bots were then tracked as regular user as well as in the BotTracker counter.
@remz-otw commented on May 17th 2021

You've right I realize that when bots=1 where activate, Googlebot where still catch by Bot tracker but Lighthouse and Mobile Friendly test tools where seen as regular user with no information (probably catch by the <img>)
Now with bots=0 (default) only Googlebot is catch by Bot tracker
This mean bots=1 make no difference for Bot Tracker

Bot Tracker is a bit buggy and not regularly maintain since matomo maintain a list of robots it should be easy to merge this module in the core no ?

@MichaIng commented on May 17th 2021

Where is this list, actually? I see the excludedUserAgent variable used but nowhere defined.

@remz-otw commented on May 17th 2021

I don't know it was referred in https://github.com/Thomas--F/BotTracker/issues/60

@MichaIng commented on May 17th 2021

Ah, it's a separate repository: https://github.com/matomo-org/device-detector/blob/master/regexes/bots.yml
Also I like the idea to make use of that device detector more natively, being able to get all info of the bot: https://github.com/Findus23/plugin-BotTracking
Being able to view them via Visits per local time diagram etc would be definitely great. But I'm happy for now that all bot visits are tracked in a way.

Powered by GitHub Issue Mirror