Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many bots fail to call the tracker PHP endpoints #17572

Closed
MichaIng opened this issue May 17, 2021 · 32 comments
Closed

Many bots fail to call the tracker PHP endpoints #17572

MichaIng opened this issue May 17, 2021 · 32 comments
Labels
answered For when a question was asked and we referred to forum or answered it.

Comments

@MichaIng
Copy link
Contributor

Now that #17497 has been done, there is another issue that most bots, especially the important Google and Bing bots, fail to call the tracker PHP endpoints matomo.php/piwik.php, while generally robots.txt and headers do allow it and some bots are successfully tracked.

Matomo, PHP and webserver do not report any errors, so the request seems to fail right at the crawler, probably related to a too long query string or invalid characters? Example URL shown by Google mobile-friendly test:

https://our.domain.com/matomo/piwik.php?action_name=<site_title>&idsite=1&rec=1&r=525912&h=7&m=3&s=1&url=<site_url>%2F&_id=d0bed9fe69afde68&_idn=1&_refts=0&send_image=0&cookie=1&res=412x732&pv_id=DAfyXy&pf_net=0&pf_srv=2&pf_tfr=0&pf_dm1=20

Further details:

Expected Behavior

All bots that crawl the website are tracked and visible in Matomo.

Current Behavior

Many bots fail to call the tracker PHP endpoints and hence do not appear in Matomo.

Possible Solution

None found so far.

Steps to Reproduce (for Bugs)

  1. Upgrade to latest Matomo v4.3 release candidate
  2. Install BotTracker app
  3. Watch AhrefsBot, Baiduspider, YandexBot and Wget being tracked
  4. No other bot is tracked, despite respective accesses.
  5. For Google crawler, the mobile-friendly test test reveals that the crawler is trying to perform the correct request (URL) but fails to do so with "Other error". Manually calling the same URL succeeds and shows a related Matomo visit/page view.

Context

The behaviour of search engine crawlers, which pages they crawl, how often etc, can be important to detect issues, optimise crawler settings/schedules, etc to balance server load and optimise SEO.

Your Environment

  • Matomo Version: 4.3.0-rc2
  • PHP Version: 8.0.5
  • Server Operating System: Debian Bullseye
  • Additionally installed plugins:
API, Actions, Annotations, BotTracker 2.01, BulkTracking, Contents, CoreAdminHome, CoreConsole, CoreHome, CorePluginsAdmin, CoreUpdater, CoreVisualizations, CustomJsTracker, DBStats, DarkTheme 1.1.6, Dashboard, DevicePlugins, DevicesDetection, Diagnostics, Goals, Heartbeat, ImageGraph, Insights, Installation, Intl, LanguagesManager, Live, LogViewer 4.0.1, Login, Marketplace, Monolog, Morpheus, Overlay, PagePerformance, PrivacyManager, Proxy, Referrers, Resolution, SEO, SegmentEditor, SitesManager, Transitions, UserLanguage, UsersManager, VisitFrequency, VisitTime, VisitorInterest, VisitsSummary, WebsiteMeasurable
@MichaIng MichaIng added the Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. label May 17, 2021
@MichaIng
Copy link
Contributor Author

@remz-otw
I don't use Drupal. So while it looks like it's related to Matomo v4, we are not 100% certain. As you have access logs, can you verify that the matomo.js request is done by the Google bot, but the expected following matomo.php request is not done at all (does not reach the server)?

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

@MichaIng Most bots are not able to load the javascript. Google might maybe ignore it to avoid unwanted tracking requests.
If you want to fully track all bots, you should not use javascript tracking. It would be better to use log importing for this.

@remz-otw
Copy link

@sgiehl I agree but since it is an error an not a good thing in a SEO point of vue
And put an img tag
<noscript><p><img src="https://mymatomodomain.com/matomo.php?idsite=1&amp;rec=1&amp;bots=1" style="border:0;" alt="" /></p></noscript>
And then no bot tracked

@MichaIng
Copy link
Contributor Author

MichaIng commented May 17, 2021

Google and Bing both load the JavaScript for sure, otherwise the matomo.php/piwik.php request with fully developed query string (see above) would not be done in the first place. So that is not the issue.

Probably skipping JavaScript would help regardless, but I use the Cloudflare app to enable tracking on all our pages and there is no choice possible currently. Out of interest, how would the HTML code look like to have PHP tracking done without the JavaScript?


Okay, I see the possibility of tracking via image source, although wrapping it into <noscript> means that the important bots do not process it, as they do load JavaScript. Since the long query string, that is built by the JavaScript, is missing when using the image source method, is still all client info contained with this?

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

Does a request from Google bot to the PHP endpoint reach your server? Just want to make sure they are actually performed by the bots. If they are listed in your access log, then it might be an issue with Matomo or maybe the BotTracker plugin. For me it sounds like Google does not perform the request for any reason. If that is the case you could try to play around with some settings that change the request, like disabling sendBeacon or forcing a POST request maybe.

@remz-otw
Copy link

remz-otw commented May 17, 2021

@MichaIng I run into my log i got an example of Googlebot requesing motomo.js but not the matomo.php
666.249.70.23 mymatomodomain.com - [16/May/2021:18:46:21 +0200] "GET /matomo.js HTTP/1.1" 200 20456 "https://mymatomodomain.com/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/90.0.4430.97 Safari/537.36"

EDIT: I didn't find any request for matomo.php from Googlebot

@MichaIng
Copy link
Contributor Author

MichaIng commented May 17, 2021

Okay, so we need to find out why Google and most other bots do not perform this final PHP request then. I wonder if it's done when using an image-based request without wrapping it into <noscript>, like:

<img src="https://domain.com/matomo.php?idsite=1&amp;rec=1 style="border:0;" alt="" />

I'll create a test page with this and run it through the mobile-friendly test.

@remz-otw
Copy link

remz-otw commented May 17, 2021

Ok i Add _paq.push(['setRequestMethod', 'GET']); and remove the noscript tag and the in search console it load both ressources correcly

But then bots still not tracked
EDIT I was wrong bots are tracked Wouhou !

@remz-otw
Copy link

remz-otw commented May 17, 2021

Now when i use "mobile friendly test tool" or "url inspection tool" from search console and even lighthouse it get catch by matomo
For lighthouse and mobile friendly tool bot tracker don't catch it it's mark "operating system: bot"

@MichaIng
Copy link
Contributor Author

MichaIng commented May 17, 2021

Jep, using the raw image source tracker works here as well (mobile-friendly test, done three times, tracked three times):

GoogleBot Googlebot 3 2021-05-17 15:53:24

So I suspect the long query string to be the issue, or probably doing requests originating from JavaScript?
EDIT: Used the same URL above that was generated by the JavaScript and fails there, and added it as image source: Now the Google bot performs the request and gets tracked.

@remz-otw
Copy link

I don't think is cause by the long query string cause it's work now with long query string
In my case the only change in the request is &send_image=1 since before it was set to 0

@MichaIng
Copy link
Contributor Author

Jep, I just tested it as well and the exact same URL that fails from JavaScript succeeds from the image.

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

Matomo uses sendBeacon by default (if the browser supports it). Maybe such requests are blocked.

@remz-otw
Copy link

remz-otw commented May 17, 2021

Did you try to change the javascrpit code and add
_paq.push(['setRequestMethod', 'GET']);
it work for me
But i don't know what that involve
But since then the request is with send_image=1

@remz-otw
Copy link

@MichaIng Maybe try the java request with send_image=1 to test if it is related

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

_paq.push(['setRequestMethod', 'GET']);

Forcing GET as request method automatically disables the use of sendBeacon as it uses POST only.

@MichaIng
Copy link
Contributor Author

Btw, if the BotTracker app is disabled, are bots supposed to be tracked as regular visitors, how are they dropped? I just disabled it and the mobile-friendly test still reports the error and I cannot see a related visitor log. So I guess we can rule out the app to be the culprit.

But I see you're already further with GET/sendBeacon 👍.

Since I use the Cloudflare app, I cannot change the script snipped and cannot add parameters to the request for production. But I'll test it as additional code in the test page.

@remz-otw
Copy link

remz-otw commented May 17, 2021

But then now it work with sendbeacon disable but what is the drowback of such config ? If in matomo 4 it on by default it should be for a good reason ? Less Acurate tracking or less good user experience ?

@MichaIng
Copy link
Contributor Author

I can verify that it works with _paq.push(['setRequestMethod', 'GET']);. Is there a way to keep using POST requests but disable sendBeacon only, to check whether it's probably POST requests in general, not sendBeacon in particular?

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

Matomo by default discards all visitors that are recognized as bots here:

protected function isNonHumanBot()
{
$allowBots = $this->request->getParam('bots');
$deviceDetector = StaticContainer::get(DeviceDetectorFactory::class)->makeInstance($this->userAgent );
return !$allowBots
&& ($deviceDetector->isBot() || $this->isIpInRange());
}

So when the user agent is detected as bot or within certain IP ranges its sorted out unless the parameter bots=1 is set.
You can achieve that in the tracking code by adding
_paq.push(['appendToTrackingUrl', 'bots=1']);

But to be clear here: Tracking bots within Matomo actually will ruin most of the reports. For most websites it's relevant to know how many visitors a website has and how they perform. Counting bots might explode those numbers and make them meaningless or unreliable. In most cases you may not even be able to distinguish between normal visits and bots within Matomo.

To only disable sendBeacon you could use:
_paq.push(['disableAlwaysUseSendBeacon']);

@remz-otw
Copy link

remz-otw commented May 17, 2021

@MichaIng I've tested to just deactivate sendbeacon (_paq.push(['disableAlwaysUseSendBeacon']);) and it's work fine

@sgiehl I agree with you but since it is catch by bot tracker it shouldn't be an issue no ?

@MichaIng
Copy link
Contributor Author

Great, yes it makes totally sense to not count bots by default, and the BotTracker app then has a reasonable purpose to enable it, but outside or regular metrics, as an extra feature.

_paq.push(['disableAlwaysUseSendBeacon']); alone works as well, so we found the culprit indeed 👍.

@sgiehl
Copy link
Member

sgiehl commented May 17, 2021

BotTracker uses it's own detection for bots as far as I know. So it might happen that Matomo would have ignored a request but BotTracker doesn't detect it as bot. https://github.com/Thomas--F/BotTracker/issues/60 would fix this

@MichaIng
Copy link
Contributor Author

At least it detects bots based on the user agent, so when Matomo drops a bot with a user agent that has not been added to the BotTracker list (it can be edited/extended), then it wouldn't be tracked at all. But all important ones are part of the default list.

Okay, so while I cannot alter the invocation snipped in the Cloudflare app, I can disable sendBeacon by manually editing the local Matomo code. But indeed the question is what the downsides are.

Reading: https://developer.mozilla.org/docs/Web/API/Navigator/sendBeacon
So in short, using sendBeacon, the request is sent asynchronously to reduce the chance that the user navigates to another page already, unloading the page, while analytics have not been send yet. That is especially an issue when the script is loaded at the end of the page view, also with defer flag.

In my case, with Cloudflare app, the script is in the page head without defer flag, so it shouldn't be an issue in this particular case. But usually you don't want such scripts to defer the visual/functional page load, where the issue is more apparent. Workarounds are listed in the docs above.

@MichaIng
Copy link
Contributor Author

My personal workaround for now: In piwik.js, replace if(da==="GET"){this.disableAlwaysUseSendBeacon()} with this.disableAlwaysUseSendBeacon().

Best would be actually if this was only done for bots. I had a look into the BotTracker source code if there is a way set it there, but there is nothing that would affect those flags prior to the actual tracking request. The CustomJsTracker plugin should generally enable that, but I guess again this cannot be done based on user agent, as the user agent is derived AFTER the tracker request has been done? Chicken and egg it seems to me 😄.

@remz-otw
Copy link

remz-otw commented May 17, 2021

I've test remove bots=1 from the request and still i got Google Bot tracked by Bot Tracker.
This mean that Bot tracker catch the info before matomo drop it this is a good thing

@MichaIng
Copy link
Contributor Author

This is how I understand it:

  • When using bots=1, Matomo core tracks bots, but like regular users, which is not what admins want in most cases.
  • BotTracker seems to catch exactly the cases which are (by default) excluded by Matomo, by registering to the Tracker.isExcludedVisit event: https://github.com/Thomas--F/BotTracker/blob/master/BotTracker.php#L93
    It compares the user agent of these cases against the list and stores the visit in its own database tables, if the agent matches one of the list.
  • If BotTracker worked in your cases with bots=1, then I guess the event is triggered regardless and bots were then tracked as regular user as well as in the BotTracker counter.

@remz-otw
Copy link

You've right I realize that when bots=1 where activate, Googlebot where still catch by Bot tracker but Lighthouse and Mobile Friendly test tools where seen as regular user with no information (probably catch by the <img>)
Now with bots=0 (default) only Googlebot is catch by Bot tracker
This mean bots=1 make no difference for Bot Tracker

Bot Tracker is a bit buggy and not regularly maintain since matomo maintain a list of robots it should be easy to merge this module in the core no ?

@MichaIng
Copy link
Contributor Author

Where is this list, actually? I see the excludedUserAgent variable used but nowhere defined.

@remz-otw
Copy link

I don't know it was referred in https://github.com/Thomas--F/BotTracker/issues/60

@MichaIng
Copy link
Contributor Author

Ah, it's a separate repository: https://github.com/matomo-org/device-detector/blob/master/regexes/bots.yml
Also I like the idea to make use of that device detector more natively, being able to get all info of the bot: https://github.com/Findus23/plugin-BotTracking
Being able to view them via Visits per local time diagram etc would be definitely great. But I'm happy for now that all bot visits are tracked in a way.

@justinvelluppillai
Copy link
Contributor

@MichaIng I will close this issue now as it seems there's nothing left to do here. Please let me know if you are still otherwise waiting on anything.

@justinvelluppillai justinvelluppillai closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2022
@justinvelluppillai justinvelluppillai added answered For when a question was asked and we referred to forum or answered it. and removed Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. labels Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it.
Projects
None yet
Development

No branches or pull requests

4 participants