@mattab opened this Issue on May 22nd 2018 Member

We got a report which says: " it seems that it takes 0.08 seconds per request for it to lookup the GeoLite2-City.mmdb file, which is about 1.589 days of lookups based on the queue size in my earlier message."

This is for a customer using QueuedTracking who has 1.7 million requests in the queue and the queue doesn't process, likely because the Geoip2 lookup is too slow.

-> we need to investigate/profile the speed of Geoip2 lookup and figure out how long it takes and whether it regressed from Geoip1 in terms of performance.

Can we improve the performance in some ways? cc @sgiehl @diosmosis

@sgiehl commented on May 22nd 2018 Member

Maxmind provides a PHP extension that should improve performance: https://github.com/maxmind/MaxMind-DB-Reader-php/#optional-php-c-extension

@sgiehl commented on May 22nd 2018 Member

The reason why the lookup is "slow" is quite simple to explain. Each tracking request is handled in a separate process, which opens the mmdb file for lookup. Opening the mmdb and reading the metadata takes a while. The only way to speed up the lookup would be to handle everything in one process, so the mmdb only needs to be opened up once.

@diosmosis commented on May 22nd 2018 Member

Does the PHP extension also create a new process?

@sgiehl commented on May 22nd 2018 Member

Good question. Didn't have a closer look at the C code, but I don't think it would make sense otherwise: https://github.com/maxmind/MaxMind-DB-Reader-php/blob/master/ext/maxminddb.c

@diosmosis commented on May 22nd 2018 Member

Hard to tell from the C code, guess a benchmark + more information from the user's setup are the next steps.

@fdellwing commented on May 22nd 2018 Contributor

Not a C profi, but looking at the readme it should be able to provide one global reader and query this reader every time?

global

use MaxMind\Db\Reader;

$reader = new Reader('GeoIP2-City.mmdb');

query

$foo = $reader->get($ipAddress);

Or are you not able to do this because you cannot access a preinstanced reader in the piwik.php calls?

@mattab commented on May 23rd 2018 Member

The only way to speed up the lookup would be to handle everything in one process, so the mmdb only needs to be opened up once.

@sgiehl @diosmosis How do-able is it to make this change and improve performance when QueuedTracking is used?

@diosmosis commented on May 23rd 2018 Member
@diosmosis commented on May 23rd 2018 Member

@mattab Do you know if they're using the PHP reader now? Do you know what they were using w/ GeoIP v1?

@mattab commented on May 23rd 2018 Member

They're using the default code we provide (no PHP extension) @diosmosis

@sgiehl commented on May 23rd 2018 Member

@mattab one reason why GeoIp2 might be slower: GeoIP Legacy did not include any IPv6 data. GeoIp2 includes all IPv6 data. That's a lot more data...

@mattab commented on May 23rd 2018 Member

Could we maybe profile the code and get to the bottom of the issue / understand exactly the slowness and if it can be improved?

@sgiehl commented on May 23rd 2018 Member

I just ran a simple benchmark by running 40.000 dynamic IP lookups within my virtual machine.

With default PHP library without the extensions we have:
Requests per second: 67.4305675507 (0,01483 s / req)

With the extension installed:
Request per second: 17707.655926848 (0,000056 s / req)

So everyone who want's to run fast lookups should install the extension. There is nothing we can improve that much to get such a speed improvement like the extension.

@mattab commented on May 23rd 2018 Member

Thanks @sgiehl - very useful...
Since customer is waiting for our instructions, could you please propose doc for our FAQs about geolocation so they match the Geoip2 tool? Maybe this FAQ should be updated to include these instructions? https://matomo.org/faq/how-to/faq_164/

@sgiehl commented on May 23rd 2018 Member

@mattab I have updated the FAQ to describe how to install the extension. Maybe you could have a look and maybe check if it's easy enough to understand...

@tsteur commented on May 23rd 2018 Member

Could the FAQ entry also cover how to get notifications when there is an update of the extension available and that it needs to be recompiled when changing PHP version etc? If possible, that would be great as it is important to keep the extensions up to date re bugfixes etc and to eventually avoid random issues etc.

@tsteur commented on May 23rd 2018 Member

Also I suggest to write a blog post about this, and maybe we mention it in the next newsletter in the "did you know" section as it can cause major problems to many Matomo's the performance break. Also in the release changelog should mention it.

@tsteur commented on May 27th 2018 Member

Should we create separate issue for the blog post and the newsletter entry?

@mattab commented on May 28th 2018 Member

I just ran a simple benchmark by running 40.000 dynamic IP lookups within my virtual machine.

@sgiehl Could you please paste the benchmark script you ran? we'd like to run it in production on the powerful box and see how it behaves there.

Also, can you compare Geoip1 lookup VS Geoip2 (no extension) on your virtual machine?

@sgiehl commented on May 28th 2018 Member
<?php

require_once './vendor/autoload.php';

use GeoIp2\Database\Reader;
use GeoIp2\Exception\AddressNotFoundException;

$reader = new Reader('misc/GeoLite2-City.mmdb');
$count = 40000;
$startTime = microtime(true);
for ($i = 0; $i < $count; $i++) {
    $ip = long2ip(rand(0, pow(2, 32) - 1));
    try {
        $t = $reader->city($ip);
    } catch (AddressNotFoundException $e) {
    }
    if ($i % 1000 === 0) {
        echo $i . ' ' . $ip . "\n";
    }
}
$endTime = microtime(true);

$duration = $endTime - $startTime;
echo 'Requests per second: ' . $count / $duration . "\n";

should be runable directly in Matomo home dir.

@tsteur commented on May 28th 2018 Member

will this do the "heavy" work each time? or is there anything cached?

@sgiehl commented on May 28th 2018 Member

It only opens geoip database once. so it only measures the time needed for each lookup. While tracking the database might get opened for each request (if not using queued). So that comes on top. But maybe that's not the case when using the extension.

Guess that can be easily tested when moving the $reader = ... in the for...

@tsteur commented on May 28th 2018 Member

We are basically trying to figure out how it compares GeoIP1 vs GeoIP2 to know if we need the extension or not. Ideally we wouldn't need the extension.

@sgiehl commented on May 30th 2018 Member

Here's the same script to benchmark geoip legacy with city database:

<?php

require_once './libs/MaxMindGeoIP/geoipcity.inc';

$geoip = geoip_open('./misc/GeoIPCity.dat', GEOIP_STANDARD);

$count = 40000;
$startTime = microtime(true);
for ($i = 0; $i < $count; $i++) {
    $ip = long2ip(rand(0, pow(2, 32) - 1));
    $t = geoip_record_by_addr($geoip, $ip);
    if ($i % 1000 === 0) {
        echo $i . ' ' . $ip . "\n";
    }
}
$endTime = microtime(true);

$duration = $endTime - $startTime;
echo 'Requests per second: ' . $count / $duration . "\n";

For my local instance that results in:

Requests per second: 1710.4010527379 (0,000584658 s / req)

so compared GeoIP2 is much faster with the extension, but a lot slower without the extension.

@tsteur commented on May 31st 2018 Member

GeoIp2 without extension https://github.com/matomo-org/matomo/issues/12955#issuecomment-392596769 took

Requests per second: 1991 (0.5ms)

GeoIp1 from https://github.com/matomo-org/matomo/issues/12955#issuecomment-393179263:

Requests per second: 36,000 - 42,000 (0.025ms/ req)

I have also tested the script slightly tweaked to open the DB each time in the for loop...

  • GeoIP1: 27500 requests per second (about 0.035ms/req)
  • GeoIP2: 950 requests per second only (about 1ms/req)

So it seams pretty much 1ms slower which adds quickly 2-3% to each tracking request (or more or less).

@tsteur commented on May 31st 2018 Member

and fyi I tried to install the extension as described on https://matomo.org/faq/how-to/faq_164/ through git, but there is a cd libmaxminddb missing and ./configure doesn't work.

@diosmosis commented on May 31st 2018 Member

You have to run ./bootstrap (it's in the git repo README). EDIT: ./bootstrap then ./configure .

@sgiehl commented on May 31st 2018 Member

I've improve the FAQ and mentioned the additional commands needed when cloning from git

@tsteur commented on May 31st 2018 Member

What about the blog post announcement?

@tsteur commented on May 31st 2018 Member

I will probably draft something otherwise...

@sgiehl commented on June 18th 2018 Member

A blog post has been published. Should we maybe add a more visible note in the description of the provider that it is "slow" without the extension?
Otherwise guess we can close this issue, as there is imho not much more we could do about

@tsteur commented on June 18th 2018 Member

I think notice is not needed actually.

Powered by GitHub Issue Mirror