Cache parsed device detection info to file #14751

katebutler · 2019-08-11T22:01:10Z

tsteur · 2019-08-12T02:29:45Z

core/DeviceDetector/DeviceDetectorCacheEntry.php

+        // We use hash subdirs to prevent adding too many files to the same dir
+        $hashDir = PIWIK_DOCUMENT_ROOT . self::CACHE_DIR . substr($hashedUserAgent, 0, 2);
+        if (!file_exists($hashDir)) {
+            mkdir($hashDir);


Be good to create the directory only when needed, meaning when writing a cache entry. And in general we have a method for this btw: \Piwik\Filesystem::mkdir().

Otherwise if we create directories for so many user agents it would pop up in the file integrity check and on each tracking request would check if the directory exists even when it is not needed. The file_exists in isCached will be enough 👍

tsteur

Left a comment. Seems to work otherwise though 👍

tsteur · 2019-08-12T02:35:58Z

core/DeviceDetector/DeviceDetectorFactory.php

-        $deviceDetector->setCache(new DeviceDetectorCache(86400));
-        $deviceDetector->parse();
+        if ($useFileCache && DeviceDetectorCacheEntry::isCached($userAgent)) {
+            $deviceDetector = new DeviceDetectorCacheEntry($userAgent);


not quite understanding what discardBotInformation() does but do we need to call it as well?

tsteur · 2019-08-12T02:39:15Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+
+    protected function configure()
+    {
+        $this->setName('devicedetector:warmcache');


when I run the command, I get this error:

[UnexpectedValueException] RecursiveDirectoryIterator::__construct(/misc/useragents): failed to open dir: No such file or directory

We could create the directory already in git and put a .gitkeep in there?

tsteur · 2019-08-12T02:55:15Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+
+    protected function execute(InputInterface $input, OutputInterface $output)
+    {
+        $this->clearCacheDirectory();


Can you add an option for this maybe? Eg when core ships say 50 cached entries, then we might want to add more entries but not remove the core entries

tsteur · 2019-08-12T02:59:06Z

@sgiehl could you have a quick look at the PR maybe as well?

@mattab what are your thoughts on delivering cached user agents with the Matomo release to make device detection faster? It would be an ongoing task though to have the recent most used user agents in the release. And if we wanted to take big advantage of it, and have like 3-6K user agents in there, then it would take quite a while to generate them on demand. This would basically need to happen when creating the release then since we wouldn't want to run a test that takes like 2-3 minutes just to check that all cache entries are still up to date (2-3 minutes is for say 3K user agents resulting in 12MB disk size => we wouldnt want to make the release size that much bigger I would say). Could cache like the top 50 entries with core? But for it to be useful we would always need to update it regularly.

@sgiehl maybe you have some thoughts as well on this?

On prod we would potentially cache up to 5-10K user agents.

Findus23

I think I found a typo (see comment).

And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)

Findus23 · 2019-08-12T08:53:07Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+    {
+        $matches = array();
+        foreach (self::$userAgentsPatternsToIgnore as $pattern) {
+            preg_match('/Amazon-Route53-Health-Check-Service[.]*/', $userAgent,   $matches);


Should it be $pattern here instead of the hardcoded value?

sgiehl · 2019-08-12T09:07:02Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+            if ($matches) {
+                return false;
+            }
+        }


This block of code is kind of broken. If you would extend the list in self::$userAgentsPatternsToIgnore it would still try to match the amazon one for each entry as $pattern is not used.

But besides of that I wouldn't handle a useragent exclude in the warmup script at all (see other comment)

sgiehl · 2019-08-12T09:11:37Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+            'os' => $deviceDetector->getOs()
+        );
+
+        $outputPath = DeviceDetectorCacheEntry::getCachePath($userAgentStr);


I would suggest to fully exclude detected bots from the caching as they will be excluded from tracking in most cases anyway.

Note: Bot detection is always very fast, as all other detections are skipped if a bot is detected, so there shouldn't be any reason for caching it

@sgiehl wouldn't it be still much faster to cache it? In this case it wasn't meant about excluding bots or so but actually to prevent ever caching agents which may contain personal data to prevent them being shipped with Matomo

It would be a bit faster. But ihmo there isn't much sense in speeding up tracking requests that won't be tracked at all. And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL

And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL

Awesome, if you know of some that be really helpful so we can prevent shipping them with Matomo ever. Likely won't happen anyway but still good to document 👍

tsteur · 2019-08-12T09:28:19Z

And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)

@Findus23
It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo. This file is still way faster to load than any of the previous device detector caches where we're loading big caches of yaml files. Also I would not recommend using Redis cache with Matomo unless you have a redis cache per server. If your disk is incredibly slow, this solution will be quite a bit faster than before.

Re misc vs tmp: because it's not needed to be writable it is not in tmp. It'll be shipped with Matomo. We don't want to write any cache entries eg think of having 100K users with each caching 10K user agents making it like 100K x 40MB cache (10K user agents) = 4TB. Also you could send heaps of random user agents and it would create heaps of cache files. Instead we only want to cache the most X used user agents.

Findus23 · 2019-08-12T09:40:41Z

It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo.

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

tsteur · 2019-08-12T10:00:31Z

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

We would use like the top 30 user agents that we track. I reckon we track quite a broad mix of sites and in general the top 30 user agents are likely a bit used everywhere (eg Chrome like 60% marketshare). This cache for regular Matomo users is more like a nice to have and only really useful for high high traffic Matomo's where saving a few ms can make a huge difference. Like when you have say 100M requests a day.

tsteur · 2019-08-12T10:06:52Z

So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

Findus23 · 2019-08-12T10:13:41Z

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

So maybe can we just add the mode where all user agents tracked are logged to a file to Matomo (unless it exists already)? That way one could simply create ones own statistics (and easier contribute to DeviceDetector). A simple config.ini.php parameter that logs to one file should be enough.
As at the moment it is hard for me to say which are the most popular user agents as I can only check the access.log which has far more bots that would be logged by Matomo.

sgiehl · 2019-08-12T10:21:34Z

Only shipping the warmed cache with matomo won't be very effective. It would need to be changed to frequently. E.g. whenever Google releases a minor update for Chrome, all devices automatically pulling the update will have a new user agent and the cache wouldn't help anymore. It might be more useful to provide that list/cache in a separate repo, like we do for search engines/socials or referrerspam. So the list would be shipped in vendor, but be pulled and updated weekly or so

Findus23 · 2019-08-12T11:30:12Z

The issue with an updater is that then we can't store the list as a PHP file (as fetching code from the internet opens up tons of new issues). But I guess fetching a txt or json file and then writing to a php file during updating should be fine.

tsteur · 2019-08-12T15:04:10Z

Ideally we would make things really not too complicated since for say 99% of all Matomo's or more the device detector is not really an issue. And if so, there's always the possibility of using queued tracking etc. Detecting an agent takes us like 5ms per request without this cache so it is doing alright. Over time for sure we could also think about how to make things faster in general using this cache, that would be then maybe something for the device detector library so everyone who uses it would benefit from that cache if wanted? Not a priority though right now.

tsteur · 2019-08-12T20:34:05Z

@katebutler small change of plans. We will put the logic into a new plugin, eg DeviceDetectorCache.

Few things need to be done:

Make sure we use dependency injection... It be better to have the Device detector factory like this:

class DeviceDetectorFactory {
   
    public function makeInstance($userAgent) { // note: not static method 
       return self::getInstance($userAgent);
    }

   // the old code with getInstance and the $deviceDetectorInstances needs to stay for BC
}

Maybe double check the code... Then everywhere where we need it we can do StaticContainer::get(DeviceDetectorFactory::class)->makeInstance($userAgent).

For backwards compatibility we probably need to leave the old static method in there...

generate new plugin using generate:plugin command (or similar)
I can create a new repo once we decided the name eg matomo-org/plugin-DeviceDetectorCache
We'll have a config/config.php where we overwrite the factory like this return array(DeviceDetectorFactory::class => DI\object(DeviceDetectorCacheFactory::class));

and we have a class like

class DeviceDetectorCacheFactory extends DeviceDetectorFactory{
 private $instances = array();
 public function makeInstance($userAgent){
    if (isset($this->instances[$userAgent])) return $this->instances[$userAgent];
    if (DeviceDetectorCache::isCached($userAgent)) {
        $instance = new DeviceDetectorCache();  
    } else {
      $instance = parent::makeInstance($userAgent);
   }
    $this->instances[$userAgent] = $instance;
return $instance;
 }
}

It's just roughly the code to give the idea how we can use DI here. It's fine to have the instances possibly cached in both Factory classes...

We will store the cached entries in the plugin eg plugins/DeviceDetectorCache/useragents
Not sure yet if we publish it on the marketplace but likely we will
I will regularly update this plugin with say 10K user agents. So the plugin be later quite big.

tsteur · 2019-08-13T22:05:18Z

core/DeviceDetector/DeviceDetectorFactory.php

-    public static function getInstance($userAgent, $useFileCache = true)
+    public static function getInstance($userAgent)
+    {
+        return (new DeviceDetectorFactory())->makeInstance($userAgent);


Maybe better to write $factory = new DeviceDetectorFactory(); and then on newline $factory->makeInstance()

tsteur · 2019-08-13T22:12:23Z

plugins/DevicesDetection/Commands/WarmDeviceDetectorCache.php

+use Symfony\Component\Console\Input\InputInterface;
+use Symfony\Component\Console\Output\OutputInterface;
+
+class WarmDeviceDetectorCache extends ConsoleCommand


This command you want to remove from core :)

katebutler added 2 commits August 9, 2019 09:55

Create a file cache for parsed device info

3e85f9a

Tests for device detection file cache

866079b

katebutler added the Needs Review PRs that need a code review label Aug 11, 2019

katebutler added this to the 3.12.0 milestone Aug 11, 2019

tsteur reviewed Aug 12, 2019

View reviewed changes

tsteur requested changes Aug 12, 2019

View reviewed changes

tsteur reviewed Aug 12, 2019

View reviewed changes

Findus23 requested changes Aug 12, 2019

View reviewed changes

sgiehl reviewed Aug 12, 2019

View reviewed changes

sgiehl mentioned this pull request Aug 12, 2019

Adds detection for bot "Amazon Route53 Health Check" matomo-org/device-detector#6009

Merged

Move device detector cache into separate plugin

f78efc5

tsteur reviewed Aug 13, 2019

View reviewed changes

katebutler and others added 7 commits August 14, 2019 10:13

Move WarmDeviceDetectorCache to separate plugin

8b8acf1

Use DI to get the DeviceDetectorFactory

5142bf2

add device detector cache as a submodule

901ce94

update submodule

414261b

update submodule

06718d8

fix BC break

deaeec5

System test, deal with useragent strings > 500 characters long

0ef75ae

tsteur and others added 4 commits August 19, 2019 10:14

update submodule

dc266ee

Merge remote-tracking branch 'origin/3.x-dev' into 14585

8f5b2e9

Merge branch '14585' of https://github.com/matomo-org/matomo into 14585

a4a3c3d

Update submodule

aa4fb7e

tsteur merged commit 444e841 into 3.x-dev Aug 25, 2019

tsteur deleted the 14585 branch August 25, 2019 22:03

mattab added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache parsed device detection info to file #14751

Cache parsed device detection info to file #14751

katebutler commented Aug 11, 2019

tsteur Aug 12, 2019

tsteur left a comment

tsteur Aug 12, 2019

tsteur Aug 12, 2019

tsteur Aug 12, 2019

tsteur commented Aug 12, 2019

Findus23 left a comment

Findus23 Aug 12, 2019

sgiehl Aug 12, 2019

sgiehl Aug 12, 2019

tsteur Aug 12, 2019

sgiehl Aug 12, 2019

tsteur Aug 12, 2019

tsteur commented Aug 12, 2019

Findus23 commented Aug 12, 2019

tsteur commented Aug 12, 2019

tsteur commented Aug 12, 2019

Findus23 commented Aug 12, 2019

sgiehl commented Aug 12, 2019

Findus23 commented Aug 12, 2019

tsteur commented Aug 12, 2019

tsteur commented Aug 12, 2019 •

edited

tsteur Aug 13, 2019

tsteur Aug 13, 2019

Cache parsed device detection info to file #14751

Cache parsed device detection info to file #14751

Conversation

katebutler commented Aug 11, 2019

Choose a reason for hiding this comment

tsteur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented Aug 12, 2019

Findus23 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented Aug 12, 2019

Findus23 commented Aug 12, 2019

tsteur commented Aug 12, 2019

tsteur commented Aug 12, 2019

Findus23 commented Aug 12, 2019

sgiehl commented Aug 12, 2019

Findus23 commented Aug 12, 2019

tsteur commented Aug 12, 2019

tsteur commented Aug 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsteur commented Aug 12, 2019 •

edited