@katebutler opened this Pull Request on August 11th 2019 Member

Fixes #14585

@tsteur commented on August 12th 2019 Member

@sgiehl could you have a quick look at the PR maybe as well?

@mattab what are your thoughts on delivering cached user agents with the Matomo release to make device detection faster? It would be an ongoing task though to have the recent most used user agents in the release. And if we wanted to take big advantage of it, and have like 3-6K user agents in there, then it would take quite a while to generate them on demand. This would basically need to happen when creating the release then since we wouldn't want to run a test that takes like 2-3 minutes just to check that all cache entries are still up to date (2-3 minutes is for say 3K user agents resulting in 12MB disk size => we wouldnt want to make the release size that much bigger I would say). Could cache like the top 50 entries with core? But for it to be useful we would always need to update it regularly.

@sgiehl maybe you have some thoughts as well on this?

On prod we would potentially cache up to 5-10K user agents.

@tsteur commented on August 12th 2019 Member

And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)

@Findus23
It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo. This file is still way faster to load than any of the previous device detector caches where we're loading big caches of yaml files. Also I would not recommend using Redis cache with Matomo unless you have a redis cache per server. If your disk is incredibly slow, this solution will be quite a bit faster than before.

Re misc vs tmp: because it's not needed to be writable it is not in tmp. It'll be shipped with Matomo. We don't want to write any cache entries eg think of having 100K users with each caching 10K user agents making it like 100K x 40MB cache (10K user agents) = 4TB. Also you could send heaps of random user agents and it would create heaps of cache files. Instead we only want to cache the most X used user agents.

@Findus23 commented on August 12th 2019 Member

It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo.

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

@tsteur commented on August 12th 2019 Member

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

We would use like the top 30 user agents that we track. I reckon we track quite a broad mix of sites and in general the top 30 user agents are likely a bit used everywhere (eg Chrome like 60% marketshare). This cache for regular Matomo users is more like a nice to have and only really useful for high high traffic Matomo's where saving a few ms can make a huge difference. Like when you have say 100M requests a day.

@tsteur commented on August 12th 2019 Member

So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

@Findus23 commented on August 12th 2019 Member

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

So maybe can we just add the mode where all user agents tracked are logged to a file to Matomo (unless it exists already)? That way one could simply create ones own statistics (and easier contribute to DeviceDetector). A simple config.ini.php parameter that logs to one file should be enough.
As at the moment it is hard for me to say which are the most popular user agents as I can only check the access.log which has far more bots that would be logged by Matomo.

@sgiehl commented on August 12th 2019 Member

Only shipping the warmed cache with matomo won't be very effective. It would need to be changed to frequently. E.g. whenever Google releases a minor update for Chrome, all devices automatically pulling the update will have a new user agent and the cache wouldn't help anymore. It might be more useful to provide that list/cache in a separate repo, like we do for search engines/socials or referrerspam. So the list would be shipped in vendor, but be pulled and updated weekly or so

@Findus23 commented on August 12th 2019 Member

The issue with an updater is that then we can't store the list as a PHP file (as fetching code from the internet opens up tons of new issues). But I guess fetching a txt or json file and then writing to a php file during updating should be fine.

@tsteur commented on August 12th 2019 Member

Ideally we would make things really not too complicated since for say 99% of all Matomo's or more the device detector is not really an issue. And if so, there's always the possibility of using queued tracking etc. Detecting an agent takes us like 5ms per request without this cache so it is doing alright. Over time for sure we could also think about how to make things faster in general using this cache, that would be then maybe something for the device detector library so everyone who uses it would benefit from that cache if wanted? Not a priority though right now.

@tsteur commented on August 12th 2019 Member

@katebutler small change of plans. We will put the logic into a new plugin, eg DeviceDetectorCache.

Few things need to be done:

  • Make sure we use dependency injection... It be better to have the Device detector factory like this:

    class DeviceDetectorFactory {
    
    public function makeInstance($userAgent) { // note: not static method 
       return self::getInstance($userAgent);
    }
    
    // the old code with getInstance and the $deviceDetectorInstances needs to stay for BC
    }

    Maybe double check the code... Then everywhere where we need it we can do StaticContainer::get(DeviceDetectorFactory::class)->makeInstance($userAgent).

For backwards compatibility we probably need to leave the old static method in there...

  • generate new plugin using generate:plugin command (or similar)
  • I can create a new repo once we decided the name eg matomo-org/plugin-DeviceDetectorCache
  • We'll have a config/config.php where we overwrite the factory like this return array(DeviceDetectorFactory::class => DI\object(DeviceDetectorCacheFactory::class));

and we have a class like

class DeviceDetectorCacheFactory extends DeviceDetectorFactory{
 private $instances = array();
 public function makeInstance($userAgent){
    if (isset($this->instances[$userAgent])) return $this->instances[$userAgent];
    if (DeviceDetectorCache::isCached($userAgent)) {
        $instance = new DeviceDetectorCache();  
    } else {
      $instance = parent::makeInstance($userAgent);
   }
    $this->instances[$userAgent] = $instance;
return $instance;
 }
}

It's just roughly the code to give the idea how we can use DI here. It's fine to have the instances possibly cached in both Factory classes...

  • We will store the cached entries in the plugin eg plugins/DeviceDetectorCache/useragents

  • Not sure yet if we publish it on the marketplace but likely we will

  • I will regularly update this plugin with say 10K user agents. So the plugin be later quite big.
Powered by GitHub Issue Mirror