New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Piwik an alternative to AWStats and Urchin, build server log import script #703
Comments
First release of the script committed in [6046] |
(In [6051]) Refs #703
|
(In [6053]) refs #703 - propset eol-style |
Performance-wise: I've set up piwik in its own jail now, turned off unnecessary PHP extensions, tweaked apache, and enabled APC. If I use --recorders=48 I get good import speeds (at least at first) without the load average going too high. However, something odd happens, and some way through importing a log file the recorders drop off (I can see fewer and fewer apache processes too, so clearly it's just not being hit as much):
I don't think I have any weird throttling going on - any ideas what might be up? There's nothing else being output during the processing even with debugging on. The drop-off seems to start roughly half way through any given logfile. |
oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time). As for why your performance decreases over time, I don't know. What does a 'top' say? You'd have to find the bottleneck. It may be Apache, PHP, MySQL. On my system, I have a sustained 300 req/s for more than 3 hours. Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :) |
(In [6070]) Refs #703 Removing images from "downloads", and improving TIP message in output debug |
(In [6071]) Refs #703 Improving help message as per Cyril feedback |
(In [6074]) Refs #703 Display response output when tracking request failed (this happens for example when debug is enabled in piwik.php) |
Replying to Cyril:
I did quite a few experiments, and eventually found that 40 is about right. This is a VM running on a high powered Dell R710, so although the OS only thinks it has 4 CPUs I don't know how things actually pan out. All I know is that the number of records/sec increases pretty much linearly with --recorders up until 40. E.g. if I run at 32, I get more like 200r/sec rather than 250+r/sec. A single recorder manages around 6-7r/sec. After 40 the benefits tail off. I also tried a few experiments to see where the bottleneck might lie, for instance I stuck in a mod_rewrite to send the importer to a basic PHP file that just returned the .gif without doing any processing, but weirdly the performance was about the same. However, running with --dry-run (or just removing the line which actually calls the script) means the python script runs at around 4000r/sec, so I can only conclude the limit is in apache/php (putting in APC definitely helped). I also tried hacking the script to run a PHP wrapper script that called piwik.php directly on the command line, but it went horribly slowly, presumably because of the lag in loading up PHP. Anyway, I'm happy with 250-300r/sec. I may set up a separate VM with a tweaked kernel and optimised apache to deal with log imports anyway, so I'm sure I can improve on that figure. Regarding the steady tailing-off, what I'm wondering is: when you specify lots of recorders, do they each grab an equal number of log lines at the start then work through them? That would explain why some finish earlier than others (if e.g. one gets a lot with non-loggable lines it'd finish sooner). I notice the number of apache processes starts tailing off around half to 2/3 of the way through the log, and then just steadily decline until only 1 recorder is left.
That'd be brilliant, thank you. Thanks to you all for being so responsive in general too. Oliver. |
FYI the new 1.7.2-rc4 was released which includes the most up to date code: Download from: http://builds.piwik.org/?C=N;O=D |
oliverhumpag, thanks for your comments it's very interesting! I haven't run that for a long time and never under high load such as 300 req/s so it would be very interesting. If you install it, i would love to see the reports generated! The last time we ran XHPRof on Piwik we found 2-3 quick fixes that made things a lot better. I'm sure we can make tracker faster in many ways. It would also be good to know the % of consumption of Apache/php VS mysql (not sure the best way to do this however?). |
oliverhumpage: regarding the recorders, each request will be dispatched to a specific recorder based on its IP address. It means that if the IP address distribution of your log files isn't "even", some recorders will have more work to do than others. Which could explain the performance issues you're having, especially near the end of the import process. This dispatching was required to make sure requests are imported in the correct order. |
Actually, I do have one small request for piwik itself. Would it be possible to choose on the fly between multiple database options: you see, I'm using one physical install of piwik at 2 different URLs - one for JS-based sites, and one for log-based, and therefore also 2 different sets of db tables so that --add-sites-new-hosts on the log-based system doesn't interfere with the JS websites (they'd have the same URLs). What I've done atm is set an environment var in apache and patch core/Config.php to set $config->database to either $config->database_weblog or $config->database_js depending on that env var. However, being able to define a constant like DATABASE_CONFIG_SECTION_NAME in bootstrap.php, which Config.php then used to work out which section of the config file to use, would be much easier and more robust. I could of course just have 2 different installs of piwik, but then I have to update it twice with each release. Probably not worth enlarging the codebase just for my weird setup, but thought I'd ask - I can easily submit a patch if you're interested. |
(In [6092]) Refs #703 import-logs.py renamed to import_logs.py and added a mini test suite which tests the format autodetection. |
(In [6093]) Refs #703 Many improvements:
|
(In [6094]) Refs #703 Added option --output to redirect output to a file. |
(In [6100]) Refs #703
|
(In [6102]) Refs #703 Add license notice, Shuffle help messages order, remove short notation for clarity, improve help messages, adding Java/ + bot- + bot/ + robot as a bot |
(In [6108]) Refs #703 I'm learning Python (NOT!) |
(In [6128]) Refs #703 Now works with Python 2.5. |
(In [6129]) Refs #703 Show the summary when CTRL+C is pressed. |
(In [6130]) Refs #703 Fixed bug with --log-format-regex (thanks oliverhumpage). |
(In [6131]) Refs #703 Disable buffering when using --output. |
(In [6132]) Refs #703 Added --query-string-delimiter |
(In [6133]) Refs #703 Added --enable-http-errors and --enable-http-redirects |
(In [6134]) Refs #703 Pretty print archives dates. |
oliverhumpage: thanks for the bug report and the suggestions, I've normally committed everything you asked :) Regarding the persistent connections, I haven't patched anything. It's a builtin feature of PHP/mysqli, see: http://www.php.net/manual/en/mysqli.construct.php "Prepending host by p: opens a persistent connection." |
(In [6135]) Refs #703
|
(In [6137]) Refs #703 README update + fixing --enable-reverse-dns now works + adding common bot names |
(In [6140]) Refs #703 Catch URL exceptions during configuration |
Good catch, the query string delimiter was always appended since the latest refactoring. That should be fixed now. It's indeed possible to exclude some paths: check out --exclude-path and --exclude-path-from. |
After 17.3M good lines, got another encode error, script version 6283 Fatal error: 'ascii' codec can't encode character u'\u2013' in position 20: ordinal not in range(128) One line log: http://mike.org.uk/test5_log.txt |
(In [6295]) Refs #703 Fixed encoding bug with a non-ASCII user-agent. |
tiouk: thanks again for the bug report, it's fixed. |
Doesn't work on python 3. 2to3 fixes most of the stuff, there's still some little snags like the base64 decoder expecting bytes and getting a string. |
It's not expected to work with Python 3. It may be supported later, but that's definitely not a requirement for a very first version. Still, you're welcome to send patches to fix issues with 2to3, as long as they're rather simple (we wouldn't want to add much complexity just to support Python 3, at least not for now). |
AFAIK everything that runs on 3 should run on 2.7+, so I'd think also developing against 3 would avoid a later bigger update. Regarding patches: I'm no python dev and I don't have the resources to take care of possible problems with an interpreter not supported by upstream. I'll be happy to help when python 3 is supported, until then I'll use a python 2 slot for this. Thanks for the effort though, not being able to import log files has been a major blocker for piwik here :-) |
There are only 2-3 days left before release / freeze - is everyone happy with the script for a V1? |
I've just updated to latest and realised the addition of the file.seek(0) functions stop you being able to pipe in logs through stdin: is it possible to disable that with a flag? It's really, really useful being able to pipe logs straight from apache. |
oliverhumpage: seek is actually only used when the log format is autodetected, and I really can't see a way to avoid this (due to IIS). So you have to explicitely specify the format (with --log-format-name) when reading from stdin. Let me know if that doesn't work (it should). |
Ah, you're right - if I specify a regex or name then it stops complaining, which is fine. However, there is still a problem that no lines are being read from stdin. If I import from a file with a couple of lines in, I get results (it says lines have been parsed). However, if I specify the file as "-" and copy/paste those same lines, nothing gets logged, not even an error. I switched on --debug and did this for you:
As you can see, pasting the lines in did nothing. Importing the exact same lines but from a file gave:
Also, --show-progress appears to have got itself switched on even though it's not specified in the command-line (this happens with or without --debug). Have I got something weird in my installation, or can you reproduce this? |
(In [6403]) Fixes #3139 Adding new 'bots' parameter to the Tracking API. When set to 1 Piwik will record the request even if it is made by a bot (currently detected are only Googlebot and some Bing bots) Refs #703 - Cyril, when --enable-bots is set, can you please make sure the parameter &bots=1 is also set to piwik.php request? only in this case though. Thanks! |
EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK! |
Replying to matt:
Awesome, RC3 seems to be importing everything nicely :) Thanks |
We have now setup a demo of log analytics piwik The demo at: http://demo-log-analytics.piwik.org/ has only 1 day of data for now. It has 2 websites to show default import mode and full mode (with bots, files, errors, etc.) I will now close this ticket as it is getting quite long, but reopened another one to keep track of the next features: #3163 Please post all new bug reports, feature suggestions in this ticket: #3163 |
(In [6433]) Refs #703 Set bots=1 accordingly. |
Urchin Alternative: Import your server logs in Piwik, the Free web analytics platform!
See blog post Piwik alternative to Urchin for more information.
Piwik is the Urchin alternative but also Webalyzer and AWStats alternative: with a Python script, you can now import webserver logs (apache, iis, and more) in Piwik, instead of using the javascript tracking.
Description
A Python script available in piwik/misc/log-analytics/ will parse server logs efficiently and automatically call the Piwik Tracking API to inject the visits/pageviews/downloads in Piwik.
How to install / how to use
SEE FOLLOW UP TICKET #3163
How you can help?
Tasks to do before final release
Feature requests for V2 or later
SEE FOLLOW UP TICKET #3163
The text was updated successfully, but these errors were encountered: