@MichaIng opened this Issue on September 8th 2021

Expected Behavior

Bots are filtered for the Matomo statistics and reports.

Current Behavior

Just recently we recognised a massive increase (several thousand percent) of visits of especially two pages, which are clearly related to bot activity, although I'm not 100% sure which bot it is (could be found out), the access log show quite some with significant access numbers sending requests to the tracker (implied via JavaScript).

Possible Solution

Google and Bing bots are excluded already, according to an old commit, so it should be trivial to extend the list of user agents which are not tracked. If someone could give me a hint where in the code this is done, or which 3rd party library is used for this, I'm glad to compare the list with our access logs and complement it accordingly.

Steps to Reproduce (for Bugs)

Hard to say, since bot activity highly depends on the website, the used search console/webmaster tools, backlinks, the software used etc, and whether it is actually visible or not depends on whether the bot behaves very different than a common user or not.

Context

At least the tracked bots mess with the statistics of two pages, but probably it did or does in other cases as well, which is not so significant. To have trustable statistics, it would be good to reliably ignore bots or having them handled separately.

Your Environment

  • Matomo Version: 4.5.0 beta (issue started with 4.4.x already)
  • PHP Version: 8.0.10
  • Server Operating System: Debian Bookworm
  • Additionally installed plugins:
    #### Plugins Activated:
    API, Actions, Annotations, BulkTracking, CoreAdminHome, CoreConsole, CoreHome, CorePluginsAdmin, CoreUpdater, CoreVisualizations, DBStats, DarkTheme 1.1.6, Dashboard, DevicePlugins, DevicesDetection, Diagnostics, Goals, ImageGraph, Insights, Installation, Intl, LanguagesManager, Live, LogViewer 4.0.1, Login, Marketplace, Monolog, Morpheus, PagePerformance, PrivacyManager, Proxy, Referrers, Resolution, SEO, SegmentEditor, SitesManager, Transitions, UserLanguage, UsersManager, VisitFrequency, VisitTime, VisitorInterest, VisitsSummary, WebsiteMeasurable
@Findus23 commented on September 8th 2021 Member
@MichaIng commented on September 8th 2021

Nice, worth giving a shot. I see there is this "Referrer spam" protection part of Matomo, but I though there was a user agent based filter as well?

EDIT: While it looks reasonable, the plugin does not really helps against bots and search engine crawlers, as long as they don't originate from one of the cloud provider IP ranges. The "headless browser" detection is only a small list of user agents which I have never seen before: https://github.com/matomo-org/plugin-TrackingSpamPrevention/blob/4.x-dev/BrowserDetection.php

EDIT2: Here is what I was actually looking for: https://github.com/matomo-org/device-detector/blob/master/regexes/bots.yml

@Findus23 commented on September 8th 2021 Member

EDIT2: Here is what I was actually looking for: https://github.com/matomo-org/device-detector/blob/master/regexes/bots.yml

Correct, this is what I also wanted to mention now that I got more time: Matomo also ignores all data by default from user agent, device-detector considers as bots (see e.g. https://devicedetector.lw1.at/ for an interactive version).

@MichaIng commented on September 8th 2021

I couldn't find a missing bot in the list so far 🤔. However, I'll keep looking for the faulty agent, and probably the spam protection plugin helps as well (many thanks for mentioning it).

I close the issue here and, in case, open a PR at the device detector.

This Issue was closed on September 8th 2021
Powered by GitHub Issue Mirror