New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache parsed device detection info to file #14751
Conversation
// We use hash subdirs to prevent adding too many files to the same dir | ||
$hashDir = PIWIK_DOCUMENT_ROOT . self::CACHE_DIR . substr($hashedUserAgent, 0, 2); | ||
if (!file_exists($hashDir)) { | ||
mkdir($hashDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be good to create the directory only when needed, meaning when writing a cache entry. And in general we have a method for this btw: \Piwik\Filesystem::mkdir()
.
Otherwise if we create directories for so many user agents it would pop up in the file integrity check and on each tracking request would check if the directory exists even when it is not needed. The file_exists in isCached
will be enough 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment. Seems to work otherwise though 👍
$deviceDetector->setCache(new DeviceDetectorCache(86400)); | ||
$deviceDetector->parse(); | ||
if ($useFileCache && DeviceDetectorCacheEntry::isCached($userAgent)) { | ||
$deviceDetector = new DeviceDetectorCacheEntry($userAgent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite understanding what discardBotInformation()
does but do we need to call it as well?
|
||
protected function configure() | ||
{ | ||
$this->setName('devicedetector:warmcache'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I run the command, I get this error:
[UnexpectedValueException]
RecursiveDirectoryIterator::__construct(/misc/useragents): failed to open dir: No such file or directory
We could create the directory already in git and put a .gitkeep
in there?
|
||
protected function execute(InputInterface $input, OutputInterface $output) | ||
{ | ||
$this->clearCacheDirectory(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an option for this maybe? Eg when core ships say 50 cached entries, then we might want to add more entries but not remove the core entries
@sgiehl could you have a quick look at the PR maybe as well? @mattab what are your thoughts on delivering cached user agents with the Matomo release to make device detection faster? It would be an ongoing task though to have the recent most used user agents in the release. And if we wanted to take big advantage of it, and have like 3-6K user agents in there, then it would take quite a while to generate them on demand. This would basically need to happen when creating the release then since we wouldn't want to run a test that takes like 2-3 minutes just to check that all cache entries are still up to date (2-3 minutes is for say 3K user agents resulting in 12MB disk size => we wouldnt want to make the release size that much bigger I would say). Could cache like the top 50 entries with core? But for it to be useful we would always need to update it regularly. @sgiehl maybe you have some thoughts as well on this? On prod we would potentially cache up to 5-10K user agents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found a typo (see comment).
And maybe I am missing something, but why are we creating new files instead of using the existing caching system (that could use Redis, etc.).
Wouldn't this make things slower on setups where the disk is incredibly slow (network disk, etc.)?
In addition, as it is in misc/ and not in tmp/ this creates another directory one has to make sure is secured and not public as it is in the webroot (but I guess in that case it doesn't really matter)
{ | ||
$matches = array(); | ||
foreach (self::$userAgentsPatternsToIgnore as $pattern) { | ||
preg_match('/Amazon-Route53-Health-Check-Service[.]*/', $userAgent, $matches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be $pattern
here instead of the hardcoded value?
if ($matches) { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code is kind of broken. If you would extend the list in self::$userAgentsPatternsToIgnore
it would still try to match the amazon one for each entry as $pattern
is not used.
But besides of that I wouldn't handle a useragent exclude in the warmup script at all (see other comment)
'os' => $deviceDetector->getOs() | ||
); | ||
|
||
$outputPath = DeviceDetectorCacheEntry::getCachePath($userAgentStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to fully exclude detected bots from the caching as they will be excluded from tracking in most cases anyway.
Note: Bot detection is always very fast, as all other detections are skipped if a bot is detected, so there shouldn't be any reason for caching it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgiehl wouldn't it be still much faster to cache it? In this case it wasn't meant about excluding bots or so but actually to prevent ever caching agents which may contain personal data to prevent them being shipped with Matomo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a bit faster. But ihmo there isn't much sense in speeding up tracking requests that won't be tracked at all. And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there are other bots that might include "personal" data, like wordpress crawler, which includes the URL
Awesome, if you know of some that be really helpful so we can prevent shipping them with Matomo ever. Likely won't happen anyway but still good to document 👍
@Findus23 Re misc vs tmp: because it's not needed to be writable it is not in tmp. It'll be shipped with Matomo. We don't want to write any cache entries eg think of having 100K users with each caching 10K user agents making it like 100K x 40MB cache (10K user agents) = 4TB. Also you could send heaps of random user agents and it would create heaps of cache files. Instead we only want to cache the most X used user agents. |
I'm not sure if it is possible to find a list of user agents that are common for most of Matomo users as depending on the website target group and country they might differ quite a bit. So maybe we should add a mode where Matomo temporarily logs all user agents into a database table and afterwards a command could calculate the most popular user agents from the own database and create a cache for them. |
We would use like the top 30 user agents that we track. I reckon we track quite a broad mix of sites and in general the top 30 user agents are likely a bit used everywhere (eg Chrome like 60% marketshare). This cache for regular Matomo users is more like a nice to have and only really useful for high high traffic Matomo's where saving a few ms can make a huge difference. Like when you have say 100M requests a day. |
This mode is for us basically a log/csv file and the warm cache command... we wouldn't want to put anything in the DB since it would make things slower for us in the end. |
So maybe can we just add the mode where all user agents tracked are logged to a file to Matomo (unless it exists already)? That way one could simply create ones own statistics (and easier contribute to DeviceDetector). A simple config.ini.php parameter that logs to one file should be enough. |
Only shipping the warmed cache with matomo won't be very effective. It would need to be changed to frequently. E.g. whenever Google releases a minor update for Chrome, all devices automatically pulling the update will have a new user agent and the cache wouldn't help anymore. It might be more useful to provide that list/cache in a separate repo, like we do for search engines/socials or referrerspam. So the list would be shipped in vendor, but be pulled and updated weekly or so |
The issue with an updater is that then we can't store the list as a PHP file (as fetching code from the internet opens up tons of new issues). But I guess fetching a txt or json file and then writing to a php file during updating should be fine. |
Ideally we would make things really not too complicated since for say 99% of all Matomo's or more the device detector is not really an issue. And if so, there's always the possibility of using queued tracking etc. Detecting an agent takes us like 5ms per request without this cache so it is doing alright. Over time for sure we could also think about how to make things faster in general using this cache, that would be then maybe something for the device detector library so everyone who uses it would benefit from that cache if wanted? Not a priority though right now. |
@katebutler small change of plans. We will put the logic into a new plugin, eg Few things need to be done:
class DeviceDetectorFactory {
public function makeInstance($userAgent) { // note: not static method
return self::getInstance($userAgent);
}
// the old code with getInstance and the $deviceDetectorInstances needs to stay for BC
} Maybe double check the code... Then everywhere where we need it we can do For backwards compatibility we probably need to leave the old static method in there...
and we have a class like class DeviceDetectorCacheFactory extends DeviceDetectorFactory{
private $instances = array();
public function makeInstance($userAgent){
if (isset($this->instances[$userAgent])) return $this->instances[$userAgent];
if (DeviceDetectorCache::isCached($userAgent)) {
$instance = new DeviceDetectorCache();
} else {
$instance = parent::makeInstance($userAgent);
}
$this->instances[$userAgent] = $instance;
return $instance;
}
} It's just roughly the code to give the idea how we can use DI here. It's fine to have the instances possibly cached in both Factory classes...
|
public static function getInstance($userAgent, $useFileCache = true) | ||
public static function getInstance($userAgent) | ||
{ | ||
return (new DeviceDetectorFactory())->makeInstance($userAgent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to write $factory = new DeviceDetectorFactory(); and then on newline $factory->makeInstance()
use Symfony\Component\Console\Input\InputInterface; | ||
use Symfony\Component\Console\Output\OutputInterface; | ||
|
||
class WarmDeviceDetectorCache extends ConsoleCommand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command you want to remove from core :)
Fixes #14585