Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache parsed device detection info to file #14751

Merged
merged 14 commits into from Aug 25, 2019
Merged

Cache parsed device detection info to file #14751

merged 14 commits into from Aug 25, 2019

Conversation

katebutler
Copy link

Fixes #14585

@katebutler katebutler added the Needs Review PRs that need a code review label Aug 11, 2019
@katebutler katebutler added this to the 3.12.0 milestone Aug 11, 2019
// We use hash subdirs to prevent adding too many files to the same dir
$hashDir = PIWIK_DOCUMENT_ROOT . self::CACHE_DIR . substr($hashedUserAgent, 0, 2);
if (!file_exists($hashDir)) {
mkdir($hashDir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be good to create the directory only when needed, meaning when writing a cache entry. And in general we have a method for this btw: \Piwik\Filesystem::mkdir().

Otherwise if we create directories for so many user agents it would pop up in the file integrity check and on each tracking request would check if the directory exists even when it is not needed. The file_exists in isCached will be enough 👍

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment. Seems to work otherwise though 👍

$deviceDetector->setCache(new DeviceDetectorCache(86400));
$deviceDetector->parse();
if ($useFileCache && DeviceDetectorCacheEntry::isCached($userAgent)) {
$deviceDetector = new DeviceDetectorCacheEntry($userAgent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite understanding what discardBotInformation() does but do we need to call it as well?


protected function configure()
{
$this->setName('devicedetector:warmcache');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I run the command, I get this error:

 [UnexpectedValueException]                                                                                                                                    
  RecursiveDirectoryIterator::__construct(/misc/useragents): failed to open dir: No such file or directory  

We could create the directory already in git and put a .gitkeep in there?


protected function execute(InputInterface $input, OutputInterface $output)
{
$this->clearCacheDirectory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an option for this maybe? Eg when core ships say 50 cached entries, then we might want to add more entries but not remove the core entries

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

@sgiehl could you have a quick look at the PR maybe as well?

@mattab what are your thoughts on delivering cached user agents with the Matomo release to make device detection faster? It would be an ongoing task though to have the recent most used user agents in the release. And if we wanted to take big advantage of it, and have like 3-6K user agents in there, then it would take quite a while to generate them on demand. This would basically need to happen when creating the release then since we wouldn't want to run a test that takes like 2-3 minutes just to check that all cache entries are still up to date (2-3 minutes is for say 3K user agents resulting in 12MB disk size => we wouldnt want to make the release size that much bigger I would say). Could cache like the top 50 entries with core? But for it to be useful we would always need to update it regularly.

@sgiehl maybe you have some thoughts as well on this?

On prod we would potentially cache up to 5-10K user agents.

Copy link
Member

@Findus23 Findus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a typo (see comment).

And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)

{
$matches = array();
foreach (self::$userAgentsPatternsToIgnore as $pattern) {
preg_match('/Amazon-Route53-Health-Check-Service[.]*/', $userAgent, $matches);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be $pattern here instead of the hardcoded value?

if ($matches) {
return false;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code is kind of broken. If you would extend the list in self::$userAgentsPatternsToIgnore it would still try to match the amazon one for each entry as $pattern is not used.

But besides of that I wouldn't handle a useragent exclude in the warmup script at all (see other comment)

'os' => $deviceDetector->getOs()
);

$outputPath = DeviceDetectorCacheEntry::getCachePath($userAgentStr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to fully exclude detected bots from the caching as they will be excluded from tracking in most cases anyway.

Note: Bot detection is always very fast, as all other detections are skipped if a bot is detected, so there shouldn't be any reason for caching it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiehl wouldn't it be still much faster to cache it? In this case it wasn't meant about excluding bots or so but actually to prevent ever caching agents which may contain personal data to prevent them being shipped with Matomo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a bit faster. But ihmo there isn't much sense in speeding up tracking requests that won't be tracked at all. And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL

Awesome, if you know of some that be really helpful so we can prevent shipping them with Matomo ever. Likely won't happen anyway but still good to document 👍

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)

@Findus23
It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo. This file is still way faster to load than any of the previous device detector caches where we're loading big caches of yaml files. Also I would not recommend using Redis cache with Matomo unless you have a redis cache per server. If your disk is incredibly slow, this solution will be quite a bit faster than before.

Re misc vs tmp: because it's not needed to be writable it is not in tmp. It'll be shipped with Matomo. We don't want to write any cache entries eg think of having 100K users with each caching 10K user agents making it like 100K x 40MB cache (10K user agents) = 4TB. Also you could send heaps of random user agents and it would create heaps of cache files. Instead we only want to cache the most X used user agents.

@Findus23
Copy link
Member

It's not supposed to be a traditional cache that we create on demand but instead the cache is prewarmed and shipped with Matomo.

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

We would use like the top 30 user agents that we track. I reckon we track quite a broad mix of sites and in general the top 30 user agents are likely a bit used everywhere (eg Chrome like 60% marketshare). This cache for regular Matomo users is more like a nice to have and only really useful for high high traffic Matomo's where saving a few ms can make a huge difference. Like when you have say 100M requests a day.

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them.

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

@Findus23
Copy link
Member

This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end.

So maybe can we just add the mode where all user agents tracked are logged to a file to Matomo (unless it exists already)? That way one could simply create ones own statistics (and easier contribute to DeviceDetector). A simple config.ini.php parameter that logs to one file should be enough.
As at the moment it is hard for me to say which are the most popular user agents as I can only check the access.log which has far more bots that would be logged by Matomo.

@sgiehl
Copy link
Member

sgiehl commented Aug 12, 2019

Only shipping the warmed cache with matomo won't be very effective. It would need to be changed to frequently. E.g. whenever Google releases a minor update for Chrome, all devices automatically pulling the update will have a new user agent and the cache wouldn't help anymore. It might be more useful to provide that list/cache in a separate repo, like we do for search engines/socials or referrerspam. So the list would be shipped in vendor, but be pulled and updated weekly or so

@Findus23
Copy link
Member

The issue with an updater is that then we can't store the list as a PHP file (as fetching code from the internet opens up tons of new issues). But I guess fetching a txt or json file and then writing to a php file during updating should be fine.

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

Ideally we would make things really not too complicated since for say 99% of all Matomo's or more the device detector is not really an issue. And if so, there's always the possibility of using queued tracking etc. Detecting an agent takes us like 5ms per request without this cache so it is doing alright. Over time for sure we could also think about how to make things faster in general using this cache, that would be then maybe something for the device detector library so everyone who uses it would benefit from that cache if wanted? Not a priority though right now.

@tsteur
Copy link
Member

tsteur commented Aug 12, 2019

@katebutler small change of plans. We will put the logic into a new plugin, eg DeviceDetectorCache.

Few things need to be done:

  • Make sure we use dependency injection... It be better to have the Device detector factory like this:
class DeviceDetectorFactory {
   
    public function makeInstance($userAgent) { // note: not static method 
       return self::getInstance($userAgent);
    }

   // the old code with getInstance and the $deviceDetectorInstances needs to stay for BC
}

Maybe double check the code... Then everywhere where we need it we can do StaticContainer::get(DeviceDetectorFactory::class)->makeInstance($userAgent).

For backwards compatibility we probably need to leave the old static method in there...

  • generate new plugin using generate:plugin command (or similar)
  • I can create a new repo once we decided the name eg matomo-org/plugin-DeviceDetectorCache
  • We'll have a config/config.php where we overwrite the factory like this return array(DeviceDetectorFactory::class => DI\object(DeviceDetectorCacheFactory::class));

and we have a class like

class DeviceDetectorCacheFactory extends DeviceDetectorFactory{
 private $instances = array();
 public function makeInstance($userAgent){
    if (isset($this->instances[$userAgent])) return $this->instances[$userAgent];
    if (DeviceDetectorCache::isCached($userAgent)) {
        $instance = new DeviceDetectorCache();  
    } else {
      $instance = parent::makeInstance($userAgent);
   }
    $this->instances[$userAgent] = $instance;
return $instance;
 }
}

It's just roughly the code to give the idea how we can use DI here. It's fine to have the instances possibly cached in both Factory classes...

  • We will store the cached entries in the plugin eg plugins/DeviceDetectorCache/useragents

  • Not sure yet if we publish it on the marketplace but likely we will

  • I will regularly update this plugin with say 10K user agents. So the plugin be later quite big.

public static function getInstance($userAgent, $useFileCache = true)
public static function getInstance($userAgent)
{
return (new DeviceDetectorFactory())->makeInstance($userAgent);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to write $factory = new DeviceDetectorFactory(); and then on newline $factory->makeInstance()

use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;

class WarmDeviceDetectorCache extends ConsoleCommand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command you want to remove from core :)

@tsteur tsteur merged commit 444e841 into 3.x-dev Aug 25, 2019
@tsteur tsteur deleted the 14585 branch August 25, 2019 22:03
@mattab mattab added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device detector cache
5 participants