New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only use provider to detect country if IP is not anonymized and lookup is not explicitly disabled (?dp=1) #17033
Only use provider to detect country if IP is not anonymized and lookup is not explicitly disabled (?dp=1) #17033
Conversation
* Lookup provider information only if IP is not anonymized and lookup is allowed Looking up provider information requires a DNS reverse lookup which can severely slow down bulk tracking. Only try if IP is *not* anonymized and provider lookups are not explicitly disabled for the request (?dp=1).
// which will slow down tracking. | ||
// Only attempt if IP is not anonymized and provider lookups are not explicitly | ||
// disabled for this request (?dp=1). | ||
$anonymizedIP = substr($userInfo['ip'], -2, 2) == '.0'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi this might only work for IPv4 addresses I suppose. I haven't tested but it might work to do
$isAnonymizedIP = $userInfo['ip'] != $request->getIpString();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that is probably true. This particular check is copied from the Provider
plugin. That should probably adjusted then as well. I’ll have a look.
Checking against the IP from the query is probably not enough, since it could already be sent anonymized, in which case I guess the IP’s would still match and not be properly recognized as anonymized.
I will look into the code that takes care of anonymozation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I wonder if this piece of code shouldn't be removed here at all. Imho it is unexpected behaviour that the country is guessed based on the provider when eg a GeoIP2 database is used, but doesn't return a country. Maybe this should be part of the default location provider, that guesses the country based on the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure it could be moved to the default location provider 👍 would we try to detect first by language or by provider? (I suppose by provider if installed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the responsibility of the different plugins I would think that UserCountry
is the correct plugin to determine the country by various methods of course, but I would see the UserCountry
plugin ask the default location provider (got mixed up with provider plugin at first) which in turn would ask the Provider
plugin (if available) for the provider info. I don't know if that is possible at the moment, since I don't know anything about the inner workings of Matomo at all, but it would make sure that the provider (hostname) is only queried once and cached in memory. Does that make sense? If you can tell me if it there's a way to save the provider info so that other plugins can access that if already available or ask the provider plugin if not, I would gladly make the adjustments necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukele It's actually not that easy. The user country plugin actually already should use the configured location provider to fetch the location information. So it would be fine to move the provider check to the default location provider.
Using the information, the provider plugin might hold, won't be possible without some changes to it. Problem is, that the provider plugin itself also uses the configured location provider to fetch the provider (e.g. for geoip2). So doing this could end up in an endless loop.
Had a look at the code myself to check how it would be solvable. Quickly pushed a poc to a new branch. Didn't think too much about that yet or did some testing. If you have some time feel free to adjust your changes accordingly and maybe do some testing 🙈
https://github.com/matomo-org/matomo/compare/moveprovidercountryguess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgiehl That's pretty much the changes that I had locally, but was stuck on checking if an IPv6 is anonymized (not so easy to determine, it seems). In regards to the endless loop, I was thinking about the following solution.
The Provider plugin (Provider Column Class?) exposes a method getHostnameForIP() which caches its result via Visitor->setVisitorColumn (only way I found by looking at the code to cache values). This method can be used by the Provider Column class itself in onNewVisit. So if the default location queries the hostname from the Provider Column, Piwik\Plugins\Provider\Columns\Provider->onNewVisit would find the provider already by querying $this->getLocationDetail($userInfo, LocationProvider::ISP_KEY). That way no endless loop should occur. Am I missing something? Is there a better way for temporarily caching the hostname besides using Visitor->setVisitorColumn?
What I noticed about your code is that you check if $privacyConfig->useAnonymizedIpForVisitEnrichment
in order to determine if the hostname should be fetched or not. The problem with this is, that it would also ignore imported access_log files which might still include non-anonymized IPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way no endless loop should occur. Am I missing something? Is there a better way for temporarily caching the hostname besides using Visitor->setVisitorColumn?
You should avoid calling setVisitorColumn
for caching purpose. If you want to shortly cache something, you can use the transient cache eg \Piwik\Cache::getTransientCache()
. Those information will be lost after the current process.
What I noticed about your code is that you check if $privacyConfig->useAnonymizedIpForVisitEnrichment in order to determine if the hostname should be fetched or not. The problem with this is, that it would also ignore imported access_log files which might still include non-anonymized IPs.
Imported logs are going through the normal tracking request as well. That means if IP anonymization is enabled and the log contains a full IP, Matomo will anonymize that as well. Problem would be if it would be vise versa. E.g. the log contains only anonymized IPs, but Matomo wouldn't anonymize them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukele are you planning to finish this PR? Would be awesome if we could include it in the next release.
If you need any further help don't hesitate to ask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgiehl Yes, I hope to finish it so it can be reviewed by Monday.
@lukele just checking in if you are still working on this issue? |
Closing in favor of #19085 |
Looking up provider information requires a DNS reverse lookup which can severely
slow down bulk tracking.
Only try if IP is not anonymized and provider lookups are not explicitly disabled for the request (?dp=1).
Description:
The import_logs.py server log analytics import script allows to disable reverse DNS lookups to speed up bulk imports using the query parameter
dp=1
. While theProvider
plugin does properly respect the parameter and skips any lookups if the parameter is set, theUserCountry
plugin ignores the parameter.In my case this adds about 0.4s for each imported hit.
The only workaround would be to disable the
UserCountry
plugin completely which is not great, since it should still be used for standard website tracking.In addition, in my case the IP addresses were already sent anonymized and again, while the
Provider
properly skips any reverse lookups – that would fail – if the IP is anonymized, theUserCountry
plugin takes any IP to try to detect the country based on provider information.Review