Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing logs very slow - 2 days for 1 Gb #10248

Closed
Synchro opened this issue Jun 22, 2016 · 5 comments
Closed

Importing logs very slow - 2 days for 1 Gb #10248

Synchro opened this issue Jun 22, 2016 · 5 comments
Labels
answered For when a question was asked and we referred to forum or answered it. c: Performance For when we could improve the performance / speed of Matomo.
Milestone

Comments

@Synchro
Copy link

Synchro commented Jun 22, 2016

I've just started using Piwik to analyse some web server logs (vanilla nginx access logs) for 2 sites on the same server, one with around 9 million log lines, the other with 24million, combined compressed log size around 1Gb, covering 1 year of traffic. I'm only interested in file-based analysis, so I'm not using the javascript tracker at all.

The import worked, but it was very, very slow, taking over 2 days for the import. The server has 5 CPUs and 8G of RAM and is running PHP 5.4 and MySQL 5.6, and is generally lightly loaded. Here are the results of the second site's import using 3 importer processes:

Logs import summary
-------------------

    24628478 requests imported successfully
    153417 requests were downloads
    1 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        1 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    24628478 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 94279 seconds
    Requests imported per second: 261.23 requests per second

I don't have the stats from the first import, but it was also very long.

The resulting MySQL database is 6.4Gb on disk. I have started the Piwik archiving process to process these logs, but there is no indication of how long it might take - is there any way to find out? So far I've got feedback like this:

INFO [2016-06-22 22:09:02] Starting Piwik reports archiving...
INFO [2016-06-22 22:09:03] Will pre-process for website id = 1, period = day, date = last1000
INFO [2016-06-22 22:09:03] - pre-processing all visits
INFO [2016-06-22 22:58:54] Archived website id = 1, period = day, 0 segments, 0 visits in last 1000 days, 0 visits today, Time elapsed: 2991.650s
INFO [2016-06-22 22:58:54] Will pre-process for website id = 1, period = week, date = last1000
INFO [2016-06-22 22:58:54] - pre-processing all visits

So to pre-process all daily stats for 1 site took about 50 mins. I'm concerned that the amount of CPU and database activity it's using may affect the site.

Is this kind of performance normal for Piwik? I've previously used AWStats to process similar-sized logs and had it take only a few minutes.

@Synchro
Copy link
Author

Synchro commented Jun 23, 2016

The archiving phase took about 2 hours.

@tsteur
Copy link
Member

tsteur commented Jun 23, 2016

In case the "Provider" plugin is enabled, try to disable it. This can make it much faster

@mattab
Copy link
Member

mattab commented Jul 8, 2016

In case the "Provider" plugin is enabled, try to disable it. This can make it much faster

fyi @tsteur the Provider plugin is automatically disabled by the log importer, unless --enable-reverse-dns is specified

Requests imported per second: 261.23 requests per second

It's decent number, but to improve performance of the log import process, one can try to increase the imported requests per second. We have before got 600 or even 800 requests per second using log analytics. It will help performance if you use the QueuedTracking plugin: http://plugins.piwik.org/QueuedTracking

@mattab mattab added the c: Performance For when we could improve the performance / speed of Matomo. label Jul 8, 2016
@Synchro
Copy link
Author

Synchro commented Jul 12, 2016

Since this import I set up Piwik to do an hourly import/analyse run as per the docs. What I'm seeing is spectacularly bad performance that's interfering with normal server operation. It's clear from my activity monitoring that Piwik restarts its analysis from the begining of the log file every time it's run, and it hits the database for every single record, every time - so as the current log file gets bigger, Piwik's processing time gets longer and longer, until the log file is rotated and it drops back down again.

You can see this happening in this graph: the little peaks get bigger through the day:

iostat-day

Over a week it looks like this:

iostat-week

This doest look like much, but it's actually saturating disk access. Here's the database POV:

mysql_commands-week

This is all due to Piwik activity because the app on this server only uses MyISAM, and this graph only shows InnoDB, which Piwik uses. You can see here the InnoDB activity since installing Piwik: the initial import (that I originally opened this ticket for) followed by 2 weeks of hourly imports:

mysql_commands-month

What I gather from this is that Piwik does not know how to skip records that have been seen before - so rather than looking up the latest timestamp in the DB and then scanning the log file and ignoring all lines prior to that, or storing a log file inode and byte ofset so that it can jump straight to where it left off, it starts from the beginning every time and absolutely hammers the database and iowait.

I'm not using reverse DNS lookups, I don't need them for this case.

Is this normal?

@mattab
Copy link
Member

mattab commented Jul 12, 2016

@Synchro Thanks for the feedback. it is "normal" that Piwik works in this way, but we are aware this is not good enough and we hope to do better in the future 👍 I've created a new issue to discuss this particular problem: Log Analytics could detect log lines that were already imported and skip them automatically #144 I hope the description helps. for now you may have to workaround this issue by importing each log file only once, eg. once a day.

@mattab mattab added this to the 2.16.x (LTS) milestone Jul 15, 2016
@mattab mattab closed this as completed Jul 15, 2016
@mattab mattab added the answered For when a question was asked and we referred to forum or answered it. label Jul 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it. c: Performance For when we could improve the performance / speed of Matomo.
Projects
None yet
Development

No branches or pull requests

3 participants