Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only use provider to detect country if IP is not anonymized and lookup is not explicitly disabled (?dp=1) #17033

Closed

Conversation

lukele
Copy link

@lukele lukele commented Dec 31, 2020

Looking up provider information requires a DNS reverse lookup which can severely
slow down bulk tracking.
Only try if IP is not anonymized and provider lookups are not explicitly disabled for the request (?dp=1).

Description:

The import_logs.py server log analytics import script allows to disable reverse DNS lookups to speed up bulk imports using the query parameter dp=1. While the Provider plugin does properly respect the parameter and skips any lookups if the parameter is set, the UserCountry plugin ignores the parameter.
In my case this adds about 0.4s for each imported hit.

The only workaround would be to disable the UserCountry plugin completely which is not great, since it should still be used for standard website tracking.

In addition, in my case the IP addresses were already sent anonymized and again, while the Provider properly skips any reverse lookups – that would fail – if the IP is anonymized, the UserCountry plugin takes any IP to try to detect the country based on provider information.

Review

  • Functional review done
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

* Lookup provider information only if IP is not anonymized and lookup is allowed

Looking up provider information requires a DNS reverse lookup which can severely
slow down bulk tracking.
Only try if IP is *not* anonymized and provider lookups are not explicitly disabled for the request (?dp=1).
// which will slow down tracking.
// Only attempt if IP is not anonymized and provider lookups are not explicitly
// disabled for this request (?dp=1).
$anonymizedIP = substr($userInfo['ip'], -2, 2) == '.0';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi this might only work for IPv4 addresses I suppose. I haven't tested but it might work to do

$isAnonymizedIP = $userInfo['ip'] != $request->getIpString();

Copy link
Author

@lukele lukele Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is probably true. This particular check is copied from the Provider plugin. That should probably adjusted then as well. I’ll have a look.

Checking against the IP from the query is probably not enough, since it could already be sent anonymized, in which case I guess the IP’s would still match and not be properly recognized as anonymized.
I will look into the code that takes care of anonymozation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wonder if this piece of code shouldn't be removed here at all. Imho it is unexpected behaviour that the country is guessed based on the provider when eg a GeoIP2 database is used, but doesn't return a country. Maybe this should be part of the default location provider, that guesses the country based on the language.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure it could be moved to the default location provider 👍 would we try to detect first by language or by provider? (I suppose by provider if installed?)

Copy link
Author

@lukele lukele Jan 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the responsibility of the different plugins I would think that UserCountry is the correct plugin to determine the country by various methods of course, but I would see the UserCountry plugin ask the default location provider (got mixed up with provider plugin at first) which in turn would ask the Provider plugin (if available) for the provider info. I don't know if that is possible at the moment, since I don't know anything about the inner workings of Matomo at all, but it would make sure that the provider (hostname) is only queried once and cached in memory. Does that make sense? If you can tell me if it there's a way to save the provider info so that other plugins can access that if already available or ask the provider plugin if not, I would gladly make the adjustments necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukele It's actually not that easy. The user country plugin actually already should use the configured location provider to fetch the location information. So it would be fine to move the provider check to the default location provider.
Using the information, the provider plugin might hold, won't be possible without some changes to it. Problem is, that the provider plugin itself also uses the configured location provider to fetch the provider (e.g. for geoip2). So doing this could end up in an endless loop.
Had a look at the code myself to check how it would be solvable. Quickly pushed a poc to a new branch. Didn't think too much about that yet or did some testing. If you have some time feel free to adjust your changes accordingly and maybe do some testing 🙈

https://github.com/matomo-org/matomo/compare/moveprovidercountryguess

Copy link
Author

@lukele lukele Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiehl That's pretty much the changes that I had locally, but was stuck on checking if an IPv6 is anonymized (not so easy to determine, it seems). In regards to the endless loop, I was thinking about the following solution.

The Provider plugin (Provider Column Class?) exposes a method getHostnameForIP() which caches its result via Visitor->setVisitorColumn (only way I found by looking at the code to cache values). This method can be used by the Provider Column class itself in onNewVisit. So if the default location queries the hostname from the Provider Column, Piwik\Plugins\Provider\Columns\Provider->onNewVisit would find the provider already by querying $this->getLocationDetail($userInfo, LocationProvider::ISP_KEY). That way no endless loop should occur. Am I missing something? Is there a better way for temporarily caching the hostname besides using Visitor->setVisitorColumn?

What I noticed about your code is that you check if $privacyConfig->useAnonymizedIpForVisitEnrichment in order to determine if the hostname should be fetched or not. The problem with this is, that it would also ignore imported access_log files which might still include non-anonymized IPs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way no endless loop should occur. Am I missing something? Is there a better way for temporarily caching the hostname besides using Visitor->setVisitorColumn?

You should avoid calling setVisitorColumn for caching purpose. If you want to shortly cache something, you can use the transient cache eg \Piwik\Cache::getTransientCache(). Those information will be lost after the current process.

What I noticed about your code is that you check if $privacyConfig->useAnonymizedIpForVisitEnrichment in order to determine if the hostname should be fetched or not. The problem with this is, that it would also ignore imported access_log files which might still include non-anonymized IPs.

Imported logs are going through the normal tracking request as well. That means if IP anonymization is enabled and the log contains a full IP, Matomo will anonymize that as well. Problem would be if it would be vise versa. E.g. the log contains only anonymized IPs, but Matomo wouldn't anonymize them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukele are you planning to finish this PR? Would be awesome if we could include it in the next release.
If you need any further help don't hesitate to ask.

Copy link
Author

@lukele lukele Jan 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgiehl Yes, I hope to finish it so it can be reviewed by Monday.

@tsteur tsteur added this to the 4.2.0 milestone Jan 1, 2021
@sgiehl sgiehl added the Waiting for user feedback Indicates the Matomo team is waiting for feedback from the author or other users. label Jan 14, 2021
@mattab mattab modified the milestones: 4.2.0, 4.3.0 Feb 22, 2021
@sgiehl sgiehl removed the Needs Review PRs that need a code review label Mar 22, 2021
@sgiehl sgiehl marked this pull request as draft March 22, 2021 09:37
@mattab mattab modified the milestones: 4.3.0, 4.4.0 May 26, 2021
@mattab mattab modified the milestones: 4.4.0, 4.5.0 Jul 28, 2021
@justinvelluppillai justinvelluppillai modified the milestones: 4.5.0, 4.6.0 Oct 7, 2021
@tsteur
Copy link
Member

tsteur commented Oct 14, 2021

@lukele just checking in if you are still working on this issue?

@sgiehl sgiehl modified the milestones: 4.6.0, 4.7.0 Nov 18, 2021
@sgiehl sgiehl modified the milestones: 4.7.0, 4.8.0 Jan 18, 2022
@justinvelluppillai justinvelluppillai modified the milestones: 4.8.0, 4.9.0 Mar 1, 2022
@justinvelluppillai justinvelluppillai modified the milestones: 4.9.0, 4.10.0 Apr 12, 2022
@sgiehl
Copy link
Member

sgiehl commented Apr 12, 2022

Closing in favor of #19085

@sgiehl sgiehl closed this Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for user feedback Indicates the Matomo team is waiting for feedback from the author or other users.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants