Urchin Alternative: Import your server logs in Piwik, the Free web analytics platform!
See blog post Piwik alternative to Urchin for more information.
Piwik is the Urchin alternative but also Webalyzer and AWStats alternative: with a Python script, you can now import webserver logs (apache, iis, and more) in Piwik, instead of using the javascript tracking.
Description
A Python script available in piwik/misc/log-analytics/ will parse server logs efficiently and automatically call the Piwik Tracking API to inject the visits/pageviews/downloads in Piwik.
How to install / how to use
SEE FOLLOW UP TICKET #3163
How you can help?
Tasks to do before final release
Feature requests for V2 or later
SEE FOLLOW UP TICKET #3163
(In [6051]) Refs #703
(In [6053]) refs #703 - propset eol-style
Performance-wise: I've set up piwik in its own jail now, turned off unnecessary PHP extensions, tweaked apache, and enabled APC. If I use --recorders=48 I get good import speeds (at least at first) without the load average going too high. However, something odd happens, and some way through importing a log file the recorders drop off (I can see fewer and fewer apache processes too, so clearly it's just not being hit as much):
2846 lines parsed, 233 lines recorded, 233 records/sec
4372 lines parsed, 506 lines recorded, 273 records/sec
[...]
8300 lines parsed, 7570 lines recorded, 9 records/sec
8300 lines parsed, 7579 lines recorded, 9 records/sec
8300 lines parsed, 7588 lines recorded, 9 records/sec
8300 lines parsed, 7598 lines recorded, 10 records/sec
I don't think I have any weird throttling going on - any ideas what might be up? There's nothing else being output during the processing even with debugging on. The drop-off seems to start roughly half way through any given logfile.
oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).
As for why your performance decreases over time, I don't know. What does a 'top' say? You'd have to find the bottleneck. It may be Apache, PHP, MySQL. On my system, I have a sustained 300 req/s for more than 3 hours.
Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)
(In [6070]) Refs #703 Removing images from "downloads", and improving TIP message in output debug
(In [6071]) Refs #703 Improving help message as per Cyril feedback
(In [6074]) Refs #703 Display response output when tracking request failed (this happens for example when debug is enabled in piwik.php)
Replying to Cyril:
oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).
I did quite a few experiments, and eventually found that 40 is about right. This is a VM running on a high powered Dell R710, so although the OS only thinks it has 4 CPUs I don't know how things actually pan out. All I know is that the number of records/sec increases pretty much linearly with --recorders up until 40. E.g. if I run at 32, I get more like 200r/sec rather than 250+r/sec. A single recorder manages around 6-7r/sec. After 40 the benefits tail off.
I also tried a few experiments to see where the bottleneck might lie, for instance I stuck in a mod_rewrite to send the importer to a basic PHP file that just returned the .gif without doing any processing, but weirdly the performance was about the same. However, running with --dry-run (or just removing the line which actually calls the script) means the python script runs at around 4000r/sec, so I can only conclude the limit is in apache/php (putting in APC definitely helped). I also tried hacking the script to run a PHP wrapper script that called piwik.php directly on the command line, but it went horribly slowly, presumably because of the lag in loading up PHP.
Anyway, I'm happy with 250-300r/sec. I may set up a separate VM with a tweaked kernel and optimised apache to deal with log imports anyway, so I'm sure I can improve on that figure.
Regarding the steady tailing-off, what I'm wondering is: when you specify lots of recorders, do they each grab an equal number of log lines at the start then work through them? That would explain why some finish earlier than others (if e.g. one gets a lot with non-loggable lines it'd finish sooner). I notice the number of apache processes starts tailing off around half to 2/3 of the way through the log, and then just steadily decline until only 1 recorder is left.
Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)
That'd be brilliant, thank you. Thanks to you all for being so responsive in general too.
Oliver.
FYI the new 1.7.2-rc4 was released which includes the most up to date code: Download from: http://builds.piwik.org/?C=N;O=D
oliverhumpag, thanks for your comments it's very interesting!
Since you seem keen, maybe you can consider running XHProf, the facebook php profiler: http://pecl.php.net/package/xhprof
I haven't run that for a long time and never under high load such as 300 req/s so it would be very interesting. If you install it, i would love to see the reports generated! The last time we ran XHPRof on Piwik we found 2-3 quick fixes that made things a lot better. I'm sure we can make tracker faster in many ways.
It would also be good to know the % of consumption of Apache/php VS mysql (not sure the best way to do this however?).
oliverhumpage: regarding the recorders, each request will be dispatched to a specific recorder based on its IP address. It means that if the IP address distribution of your log files isn't "even", some recorders will have more work to do than others. Which could explain the performance issues you're having, especially near the end of the import process.
This dispatching was required to make sure requests are imported in the correct order.
Actually, I do have one small request for piwik itself.
Would it be possible to choose on the fly between multiple database options: you see, I'm using one physical install of piwik at 2 different URLs - one for JS-based sites, and one for log-based, and therefore also 2 different sets of db tables so that --add-sites-new-hosts on the log-based system doesn't interfere with the JS websites (they'd have the same URLs). What I've done atm is set an environment var in apache and patch core/Config.php to set $config->database to either $config->database_weblog or $config->database_js depending on that env var.
However, being able to define a constant like DATABASE_CONFIG_SECTION_NAME in bootstrap.php, which Config.php then used to work out which section of the config file to use, would be much easier and more robust. I could of course just have 2 different installs of piwik, but then I have to update it twice with each release. Probably not worth enlarging the codebase just for my weird setup, but thought I'd ask - I can easily submit a patch if you're interested.
(In [6092]) Refs #703 import-logs.py renamed to import_logs.py and added a mini test suite which tests the format autodetection.
(In [6093]) Refs #703 Many improvements:
(In [6094]) Refs #703 Added option --output to redirect output to a file.
(In [6100]) Refs #703
(In [6102]) Refs #703 Add license notice, Shuffle help messages order, remove short notation for clarity, improve help messages, adding Java/ + bot- + bot/ + robot as a bot
(In [6108]) Refs #703 I'm learning Python (NOT!)
(In [6128]) Refs #703 Now works with Python 2.5.
(In [6129]) Refs #703 Show the summary when CTRL+C is pressed.
(In [6130]) Refs #703 Fixed bug with --log-format-regex (thanks oliverhumpage).
(In [6131]) Refs #703 Disable buffering when using --output.
(In [6132]) Refs #703 Added --query-string-delimiter
(In [6133]) Refs #703 Added --enable-http-errors and --enable-http-redirects
(In [6134]) Refs #703 Pretty print archives dates.
oliverhumpage: thanks for the bug report and the suggestions, I've normally committed everything you asked :)
Regarding the persistent connections, I haven't patched anything. It's a builtin feature of PHP/mysqli, see:
http://www.php.net/manual/en/mysqli.construct.php
"Prepending host by p: opens a persistent connection."
(In [6135]) Refs #703
(In [6137]) Refs #703 README update + fixing --enable-reverse-dns now works + adding common bot names
(In [6140]) Refs #703 Catch URL exceptions during configuration
(In [6155]) Adding advanced use case in the README. Thanks Oliver for your help and submission!! Refs #703
Cyril:
Have tested using - instead of /dev/stdin, seems to work fine.
Re the regex, I think that's explained in the comments: because I want it to pick up hostnames that are subsites and so have slashes (e.g. I want the hostname 'domain.com/subsite' to be picked up and created with that name in piwik), I needed to amend the normal vhost regex to allow "/" in the host character class. It's also a very good example of what and how to escape shell special characters in apache log pipes :)
(I spent a fun morning with a test apache installation and a perl script testing each special character in turn until I got it working... then a fun afternoon wondering why it wasn't working with import_logs.py, until I realised there wasn't a .compile for the custom regex!)
I did originally put things like "domain.com.subsite" in the hostname so the standard regex would work, but it looks ugly and non-user-friendly in piwik.
(In [6157]) refs #703 updating README as per feedback. please comment if the code does not work I haven't tested myself
Updated ticket with suggestions on tto improve script performance (ie. we should bulk send 50 requests at once in POST to have 50 times less http requests...) !!
Just regarding the persistent database connections: using "p:localhost" only works for mysqli after PHP 5.3. It didn't work for me since we're still on 5.2 (going to upgrade soon...).
Matt: that should do it I guess. I'll try to make the changes ASAP.
Sending bulk requests would be great, I'm sure that would improve the performance a lot!
The script doesn't parse IIS6 or IIS7 log files (not tried IIS8). I tried the following regex that matches the log lines in kiki but no luck with the script. Any pointers?
Also minor change to line 1068
' the --format option'
needs updating to
' either the --log-format-name or --log-format-regex option'
_IIS6_FORMAT = (
'(?P<date>^\d+[-\d+]+ [\d+:]+) '
'\S+ \S+ [\d*.]+ \S+ '
'(?P<path>\S+) '
'\S+ \d+ \S+ '
'(?P<ip>[\d*.]*) '
'\S+ '
'(?P<user_agent>\S+) '
'\S+ '
'(?P<referrer>\S+) '
'\S+ '
'(?P<status>\d+) '
'\S+ \S+ '
'(?P<length>\S+)'
)
+1
The only way I got it to work with IIS7 (just to test out the script) was to convert it to ncsa extended.
(In [6165]) Refs #703 Fixed bug: stats.piwik_sites should not have None items.
(In [6166]) Refs #703 Only show tips in summary if necessary.
(In [6167]) Refs #703 Added --exclude-path and --exclude-path-from.
(In [6168]) Refs #703 Replaced tabs with spaces.
Replying to matt:
Can you please post example log format that does not work ?
About 20 lines from two logs.
(In [6169]) Refs #703 Added custom variable Not-Bot.
(In [6170]) Refs #703 Updated error string.
Trunk version works well also without --idsite-fallback.
for some years I was looking for an alternative to awstats and with your import script I think I've found it - great work so far.
But I've troubles with the log file. We use a Lotus Notes clusters and for each server in the cluster we've a seperate log file per day.
The import is working but the result isn't ok and I think it because of the log file format.
It looks like this:
192.168.1.1 bene.com - +0200 "GET /mobiliario-de-oficina/news-filo-design-preis-2009.html HTTP/1.1" 200 20719 "" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)" 1453 "" "D:/Notes/Data/benecom/cont_es.nsf"
in awstats I can describe the log file format like this:
LogFormat=%host %virtualname %lognamequot %time1 %methodurl %code %bytesd %refererquot %uaquot %other %other %other
and http error and redirects are also not found:
17758 requests imported successfully
542 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
See more log at: http://pastebin.com/zSMXqEpu
Traceback (most recent call last):
File "C:\Python26\lib\threading.py", line 522, in __bootstrap_inner
self.run()
File "C:\Python26\lib\threading.py", line 477, in run
self.__target(*self.__args, **self.__kwargs)
File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 756, in _run
self._record_hit(hit)
File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 794, in _reco
rd_hit
'url': main_url + hit.path[:1024],
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 148: ordinal not in range(128)
I just faced a situation where the first line in an nginx log is an invalid request:
110.164.252.2 - - [14/Apr/2012:06:26:37 +0200] "-" 400 0 "-" "-"
I know which log format I'm using and I'll just use --log-format-name ncsa_extended
but maybe the script could try several lines before giving up?
matt: that's because there are UTF8 characters in the logs, and the script expects the logs to be plain ASCII. I suggest we add a new option --encoding that allows to specify if the log files are in a specific encoding rather than ASCII, what do you think?
Replying to Cyril:
matt: that's because there are UTF8 characters in the logs, and the script expects the logs to be plain ASCII. I suggest we add a new option --encoding that allows to specify if the log files are in a specific encoding rather than ASCII, what do you think?
That sounds nice, but would it be possible to test both ASCII and UTF-8 automatically when such decoding error occurs?
I suppose most logs these days are in UTF8 so it would be nice to work by default :)
I'm having problems with the script which enters an infinite loop
6846 lines parsed, 1919 lines recorded, 0 records/sec
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 860, in _run
self._record_hit(hit)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 924, in _record_hit
headers={'User-Agent' : hit.user_agent},
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 694, in call
return self._call_wrapper(self._call, expected_content, path, args, headers)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 676, in _call_wrapper
response = func(*args, **kwargs)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 625, in _call
response = urllib2.urlopen(request)
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1178, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1143, in do_open
r = h.getresponse()
File "/usr/lib/python2.6/httplib.py", line 990, in getresponse
response.begin()
File "/usr/lib/python2.6/httplib.py", line 391, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.6/httplib.py", line 355, in _read_status
raise BadStatusLine(line)
BadStatusLine
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
6846 lines parsed, 1919 lines recorded, 0 records/sec
...
and it keep repeating
6846 lines parsed, 1919 lines recorded, 0 records/sec
The offending line 1920 in my nginx log is:
62.83.238.31 - - [17/Apr/2012:12:53:35 +0200] "-" 400 0 "-" "-"
matt: defaulting to UTF8 is indeed better, since ASCII is UTF8 compatible anyway.
(In [6212]) Refs #703 Catch httplib exceptions raised by urllib2.
guardian: I don't know why you're getting this BadStatusLine exception, but the latest commit at least detects it correctly. Can you try again?
(In [6213]) Refs #703 Added --encoding, defaults to UTF8.
Hi all,
I have big problems with the import, I'm only importing 15-20 lines per second. I need to import more than 20Milions of lines. Really I need more than 10 days to import these logs?
I tried to use the command with --dry-run, but this command not insert the lines in the database, it's just to check if the comand works well.
How can I improve the import speed? (I tried also with recorders=16, but don't work well. I only have 2 cpus -4 cores-)
Do you have any guide or something like that to follow it and improve the import speed?
Thanks in advance for your feedback.
I'm using this command:
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://localhost/piwik access_log.0 --idsite=2 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-reverse-dns --enable-bots
6846 lines parsed, 215 lines recorded, 2 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
6846 lines parsed, 215 lines recorded, 0 records/sec
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 865, in _run
self._record_hit(hit)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 929, in _record_hit
headers={'User-Agent' : hit.user_agent},
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 699, in call
return self._call_wrapper(self._call, expected_content, path, args, headers)
File "/usr/local/share/www/piwik/misc/log-analytics/import_logs.py", line 691, in _call_wrapper
message = e.reason
AttributeError: 'HTTPError' object has no attribute 'reason'
It's the very same log file, which tells it's not an invalid line causing that problem. So far I don't know why it returns an empty reply, nginx and php-fpm logs remain empty about that.
(In [6215]) Refs #703 Circumvent a Python bug with urllib2.HTTPErrors.
(In [6216]) Refs #703 Call close() manually instead of relying on the garbage collector. It seems to help reducing the concurrent connections count.
Thank you Cyril,
closing the connection indeed mitigates the problem. still I'm facing SIGSEGV when using the import script. so far Piwik is the only PHP software causing SIGSEGV on my servers. I'm investigating. As per Piwik's wiki I disabled APC but that didn't help.
I tried the importer but it is too slow for real web sites with huge log files. Just an idea... how about rewrite it in PHP to avoid slow HTTP requests, and using instead internal piwik classes?
If I want to try to do it, do you have any suggestions?
Thank you
asterixcapri: what's your import speed? What's the limiting factor? Python or PHP? How large are your log files?
The Python import script can easily max out 8 Piwik PHP processes on my machines, so I doubt the HTTP requests represent a large overhead. Besides, an easy way to reduce that overhead would be to aggregate hits in a single request, which is already planned.
julianito, 15 req per second really is too slow. What kind of server do you use? Is it busy doing other things or is it mostly idle? Piwik is pretty IO intensive.
What req/s do you get in "dry-run"? it should be very high since Piwik does not do any http request then (this would help making sure the problem is with the http requests)
Just wanted to add that I get really bad speeds unless I ramp up the reporters way beyond core number: I may have said above, I get to about 40 (on a 4 core virtual machine) before noticing no further improvements. There is definitely a lag somewhere in the http requests or mysql connections on some setups. There is also the issue of the "tail end", noted above, where the number of recorders slowly drops off as IPs run out (I did try altering the script to give new IPs to the recorder with the lowest workload, but it didn't make much odds).
Having a PHP script that talked to the piwik system directly, instead of via http requests, would likely speed things up hugely for all users.
Could I check something? There's an --enable-static option, which I think you put in at my request (thanks!). I've noticed it seems to put static files (.jpg, .css, .js etc) into both Downloads and Pageviews. I seem to remember without --enable-static they went into neither.
Is it possible to have static files only put into Downloads? I'm not running the very very latest version, so apologies if this is already fixed, but I didn't see it in a changelog.
I also find Piwik is very slow per se :/
But matt saying it's really IO intensive made me think about my PHP-FPM configuration again. The pool used by Piwik has:
And I guess open_basedir
is part of the explanation why Piwik is so slow on my setup. I'm now using database session storage and using piwik feels faster already
The script works fine as a import, with one exception. We use varnish/pound proxy's in front of the websites and have to pass the incoming website IP via the X-Forwarded-For variable.
Is there a way to have that picked up in replacement to the %v.
Example log format: "\"%{X-Forwarded-For}i\" %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
tgrondin: use mod_rpaf http://stderr.net/apache/rpaf/
From email report:
I try to use the module import_logs.py for my apache logs, I have a problem because I use HAProxy and the script does not seem to consider my rgex for my log formatfollowing:
LogFormat "% v% {X-Forwarded-For} i% l% u% t \"% r \ "%> s% b \"% {Referer} i \"\"% {User-Agent} i \ ""vhost_combined
X.com 90.28.198.22 - - [11/Apr/2012:20:52:12 +0200] "GET /index.php HTTP/1.1" 200 - "http://www.X.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0) Gecko/20100101 Firefox/10.0"
I lunch the script :
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://127.0.0.1/piwik /var/log/apache2/access_webmail.log --log-format-regex "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" --add-sites-new-hosts --recorders=1
0 lines parsed, 0 lines recorded, 0 records/sec
Parsing log /var/log/apache2/access_webmail.log...
Is it the same problem as the IIS log, or maybe a different problem? or is the --log-format-regex wrong maybe?
Replying to Cyril:
asterixcapri: what's your import speed? What's the limiting factor? Python or PHP? How large are your log files?
The Python import script can easily max out 8 Piwik PHP processes on my machines, so I doubt the HTTP requests represent a large overhead. Besides, an easy way to reduce that overhead would be to aggregate hits in a single request, which is already planned.
I created a ticket for this specific performance improvement: #3134
I access piwik at www.mydomain.com/piwik (so I can use mydomain.com SSL certificate) but
the import script calls www.mydomain.com/piwik/piwik.php which generate logs, that are imported, which generate logs, and so on ... how to avoid being stuck in a loop like that ?
ma2thieu, good point, we should probably dela with this issue in the script itself
The last known important bug is the ISS log parsing.
Appart from that is everyone here happy with the script as it is?
Appart from performance which can be slow for some of you, is the script ready for prime time?
Thanks for your feedback
Hi, I want to import Lotus Domin logs but I have this error when I lunch import_logs.py :
# python26 /var/www/html/piwikbeta/misc/log-analytics/import_logs.py --url=http://www.dominux.fr/dominux/blog.nsf /usr/local/domino/domzi/weblogs/access.log --idsite=1234 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots
Fatal error: Piwik returned an invalid response: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
:
Any idea ?
PS: sorry I can't post all the error because my post are rejected : too many external urls ...
Hi, I tried to import apache logs into an empty Piwik database. The log-file has a lot of different hosts, that should be added automatically.
./import_logs.py --login=user --password=pwd --add-sites-new-hosts --url=http://site/piwik ./apache.log
However, the scripts shows the following error and won't exit.
Purging Piwik archives for dates: 2012-05-10
Traceback (most recent call last):
File "./import_logs.py", line 1221, in <module>
main()
File "./import_logs.py", line 1197, in main
Recorder.invalidate_reports()
File "./import_logs.py", line 949, in invalidate_reports
idSites=','.join(stats.piwik_sites),
TypeError: sequence item 2: expected string or Unicode, int found
@Dominux the --url should point to your piwik base URL it seems it's pointing somewhere
@oliver and all other users who have problems parsing specific logs. It would be great if you could look at the import_logs.py file and try to patch it to add the format of your logs. Then please submit the patch here once your logs are parsed, we will add it.
The code is simple to understand at the start of: https://github.com/piwik/piwik/blob/master/misc/log-analytics/import_logs.py
Patch to enable IIS6/7/7.5 log imports, probably needs some work as I don't do python!
IIS can have varied data in the logs, this should work with default log settings for 6/7/7.5 and when selecting all options from IIS6. Log format in default is W3C. No timezone offset in my logs, so TZ = 0.
Seems to work, I am in process of importing logs, at 1.5M line parsed at present, 1.47 recorded.
@tiouk, great thanks for the patch!
Can you please confirm you tested your patch on the 3 IIS log formats?
I will test & commit after your confirmation, your patch is very appreciated!
Replying to matt:
Can you please confirm you tested your patch on the 3 IIS log formats?
I have tested the iis6_w3c_all on real logs. I think there maybe an issue with the IIS7/7.5, as the although they work, the logs I tested aginst appear to have extra options selected and I can't find a definitive default format on the MS site apart from for IIS6. I have some tweaked format regex, will post when tested.
Can anyone using IIS please post the first 5 lines from a few logs and state whether you have added any non default options to the logging.
IIS logs seperate the query string from the base URL, I haven't addressed that.
(In [6260]) Refs #703 Fixed bug when invalidating reports with --add-sites-new-hosts.
oliver: thanks for the bug report, can you try again? That should be fixed.
tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.
Can you provide some log lines for each format? I'll put them in the tests/logs directory which provides automatic testing for each log format.
(In [6261]) Refs #703 Typo (renamed file logs/ncsa_extended.log)
Replying to Cyril:
tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.
Can you provide some log lines for each format? I'll put them in the tests/logs directory which provides automatic testing for each log format.
Will see if I can do tomorrow. There's an updated file on the same URL as the old patch, it has a bit of work to skip lines in an IIS log with --check-iis-logs-format and displays the log options line in --debug along with updated regexs. Cheers Mike
Typo above, the IIS log detection should be --check-iis-log-option
Short IIS6 log with all options:
http://mike.org.uk/iis6_all_options.txt
Replying to Cyril:
tiouk: thanks. Unless matt insists on doing so, I'd like to commit your patch myself as I'd like to refactor it a bit.
Please commit after checking all is working well I'm glad you're back :)
Above diff 137 has been updated, small regex changes as status only recorded 2 digits in some log types.
IIS7.5 Default short log (Has extra header lines due to IIS restart and 3 lines IPV6 as invalid log lines both could appear in live logs)
IIS logs the cs-uri-stem & cs-uri-query separately, do they need concatenating?
Replace \S+ after the <path> match with '(?P<querystr>\S+) ' in the regex's
line 1199
if config.options.check_iis_log_format:
hit = Hit(
filename=filename,
lineno=lineno,
status=match.group('status'),
full_path=match.group('path') + config.options.query_string_delimiter + match.group('querystr'),
is_download=False,
is_robot=False,
is_error=False,
is_redirect=False,
)
else:
hit = Hit(
filename=filename,
lineno=lineno,
status=match.group('status'),
full_path=match.group('path'),
is_download=False,
is_robot=False,
is_error=False,
is_redirect=False,
)
Hi, I got following error during importing log files (apache):
419375 lines parsed, 6906 lines recorded, 57 records/sec
419703 lines parsed, 6960 lines recorded, 54 records/sec
419896 lines parsed, 6976 lines recorded, 16 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
Fatal error: didn't receive the expected response. Response was <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>Piwik › Error</title>
<meta http-equiv="Conte..
You can restart the import of "/opt/piwiktests/logfile.gz" from the point it failed by specifying --skip=296312 on the command line.
with debug:
2012-05-13 12:17:59,837: [DEBUG] Error when connecting to Piwik: <urlopen error didn't receive the expected response. Response was <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"SAME URL as above">
<html>
<head>
<title>Piwik › Error</title>
<meta http-equiv="Conte.. >
The end of the log file shows:
5190432 lines parsed, 759119 lines recorded, 28 records/sec
5190432 lines parsed, 759147 lines recorded, 28 records/sec
Purging Piwik archives for dates: 2012-05-12
2012-05-13 13:49:58,052: [DEBUG] Error when connecting to Piwik: <urlopen error Piwik returned an invalid response:
<div style='word-wrap: break-word; border: 3px solid red; padding:4px; width:70%; background-color:#FFFF96;'>
<strong>There is an error. Please report the message and full backtrace in the <a href='?module=Proxy&action=redirect&url=http://forum.piwik.org' target='_blank'>Piwik forums</a> (plea>
I'm running the script on a 12 Core (24 with HT) server with this command:
python import_logs.py --url=http://piwiktest.aspectra.com/piwiklog/ --idsite=6 --recorders=12 --output=logtest_20120513.log --skip=358639 -d -d /opt/piwiktests/logfile.gz &
Sometimes the script recovers from the errors and continuous recording, sometimes it stops working and shows the --skip= option. As far as I can see it seems that the script stops working if the error occurs 4 times in a row. Is this some kind of a timeout and can it be set?
Best regards,
Andr
Diff for the auto detecting of IIS logs based on log header line 4, this should be able to decode any IIS log file whatever the options selected. One previso, it does need both the date and time options which can be de-selected. However without these options it is not possible to generate stats unless you don't mind all site visits to occur at the same time!
http://mike.org.uk/import_logs_py_diff_2.txt
Ready patched file based on version 6170 for people to test on different IIS formats, let me know how you get on. http://mike.org.uk/import_logs_py.txt (rename to .py)
Replying to tiouk:
Ready patched file based on version 6170 for people to test on different IIS formats, let me know how you get on. http://mike.org.uk/import_logs_py.txt (rename to .py)
USE ON TEST SYSTEM ONLY, NOT LIVE. If you must, use --debug to check first!
Patch 6213 applied.
Error generated by +++++; in iis log.
File "/usr/lib64/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xed in position 154: invalid continuation byte
2012-05-03 04:29:30 W3SVC230092276 SERVER1 172.16.65.25 GET /Discussions/tabid/56/forumid/1/postid/36/scope/posts/language/en-GB/Default.aspx+Result:+++++; - 80 - 94.153.71.194 HTTP/1.0 Mozilla/0.6+Beta+(Windows) - http://xxx-xxxxxx.net/Discussions/tabid/56/forumid/1/postid/36/scope/posts/language/en-GB/Default.aspx+Result:+%ED%E5+%ED%E0%F8%EB%EE%F1%FC+%F4%EE%F0%EC%FB+%E4%EB%FF+%EE%F2%EF%F0%E0%E2%EA%E8; xxx-xxxxxx.net 404 0 2 1814 715 62
Above patches tweaked for the add-sites-new-hosts option, so please download again.
Issue with option --add-sites-new-hosts.
The following are created as different hosts, they should all be considered as the same, this is with IIS log files with auto IIS log patch.
domain.dom
www.domain.dom
www.domain.dom:80
Thanks for your work we'll commit the patch soon, please keep posting if you improve it!
To all users with IIS logs please test @tiouk patch above if you can, thanks
Diff for IIS patch based on latest version 6260 from trunk.
http://mike.org.uk/import_logs_py_diff_3.txt
IIS patched full version 6260 this is the latest version from trunk with Cyril's latest committed fixes. Please use to test IIS log import, with usual live system warning, although I have used on mine!
http://mike.org.uk/import_logs_py.txt Just rename to .py
How do you want to handle decode errors? I have a couple of logs that bomb due to decode errors on a single line, I was thinking of just invadidating the line.
I replaced 1127
line.decode(config.options.encoding)
with
try:
line = line.decode(config.options.encoding)
except UnicodeError, err:
# Unicode Decode Error, the line is badly formatted.
logging.debug('Unicode decode error ' + line)
logging.debug(err)
invalid_line(line)
continue
Replying to aspectra:
Hi, I got following error during importing log files (apache):
419375 lines parsed, 6906 lines recorded, 57 records/sec
419896 lines parsed, 6976 lines recorded, 0 records/sec
I changed following constants:
PIWIK_MAX_ATTEMPTS = 9
PIWIK_DELAY_AFTER_FAILURE = 5
The errors are still occurring but the script is able to reconnect and does not exit.
Hi,
i have played around with the logfile importer in piwik 1.7.2rc8. I was surprised that there were a lot of static files in the results as I had not enabled --enable-static on command line.
I checked the logfile and looked in the importer code and found out, that many static-files of - at least - Typo3-Websites are not recognized, as long as they are suffixed with ?timestamp by Typo3 and the importer-regex just checks the end of the filesname (e.g. typo3temp/javascript_0b12553063.js?1283017207).
Can this be adjusted? Would be great!
Many thanks and a nice weekend
Version 6260 with IIS patch.
Previously decoded line generates error when posting to Piwik.
2012-05-16 14:15:08,670: [DEBUG] Error when connecting to Piwik: 'ascii' codec can't encode characters in position 126-128: ordinal not in range(128)
Raw log line:
2010-09-29 07:00:25 W3SVC3 172.16.65.22 GET /xxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxxx/tabid/129/language/en-US/Default.aspx+Result:++++/+++(++++)
args from def _call_wrapper(self, func, expected_response, *args, **kwargs):
('/piwik.php', {'cdt': '2010-09-29 07:00:25', '_cvar': u'{"1":["Not-Bot","Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Deepnet+Explorer+1.5.0;+.NET+CLR+1.0.3705)"]}', 'apiv': '1', 'cip': u'173.75.247.80', 'urlref': '', 'token_auth': u'b13583ea29ddb846f48d1ea1721a8eac', 'idsite': '3', 'url': u'http://xxxxxxxxxxx.xxx.xx/xxxxxxxxx/xxxxxxxxxxxxxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxxx/tabid/129/language/en-US/Default.aspx+Result:+\xfd\xf2\xee+\xed\xe5+\xf4\xee\xf0\xf3\xec+/+\xe3\xee\xf1\xf2\xe5\xe2\xe0\xff+\xea\xed\xe8\xe3\xe0+(\xeb\xe8\xe1\xee+\xee\xf2\xf1\xf3\xf2\xf1\xf2\xe2\xf3\xe5\xf2+\xef\xee\xe4\xea\xeb\xfe\xf7\xe5\xed\xe8\xe5+\xea+\xe8\xed\xf2\xe5\xf0\xed\xe5\xf2\xf3)?-', 'rec': '1', 'dp': '0'}, {'Content-type': 'application/x-www-form-urlencoded', 'User-Agent': u'Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Deepnet+Explorer+1.5.0;+.NET+CLR+1.0.3705)'})
(In [6268]) Refs #703 Do not crash on encoding errors (thanks invalid_line(line)
Oops, I screwed my commit message, I meant to say "thanks tiouk" (bad copy/paste). Centralized version control software sucks :)
(In [6269]) Refs #703 Added support for IIS logs, thanks to tiouk.
IIS parsing is now supported in trunk. I had to refactor quite a bit of code, so I highly suggest everyone to test the script again, I may have introduced bugs.
The patch was inspired by tiouk's own diff, thanks to him, but the code itself is quite different as I wanted to have a more generic approach. There are no new options, IIS is expected to work just like other log formats.
Nice work, glad to see you've integrated rather than included as an addon.
Seems to work fine, so far with logs that I know worked with the old patched version, but I still get the issue in post 160 with lines containing extended chars being posted to Piwik causing the script to choke.
I was going to throw a pile of logs at it, but my test VM tops out at 25 rec/sec and the DL380G7 I got down to install as a dedicated Piwik server has issues, HP engineer on site tomorrow!
I can't reproduce the error you get in post 160. I copy/pasted the line you specified and saved it in a IIS log file, maybe the resulting file has a different encoding than yours. Could you somehow create a minimal file that exhibits this issue and give me a link to download it? What command line did you use?
Replying to Cyril:
I can't reproduce the error you get in post 160. I copy/pasted the line you specified and saved it in a IIS log file, maybe the resulting file has a different encoding than yours. Could you somehow create a minimal file that exhibits this issue and give me a link to download it? What command line did you use?
Head, tail & sed = http://mike.org.uk/badiis7_log.txt
Forgot the command line:
python /home/piwik/public_html/misc/log-analytics/import_logs.py --url=http://192.168.10.113:88 badiis7.log --idsite=25 --recorders=4 --enable-http-redirects --enable-static --enable-bots --enable-reverse-dns --debug
Hi, we have some issues with the latest build of the import on our IIS 7.5 logs. The following messages are shown and the import stopped:
...
2012-05-17 22:07:07,918: [DEBUG] Error when connecting to Piwik: 'ascii' codec c
an't encode character u'\xe4' in position 32: ordinal not in range(128)
Fatal error: 'ascii' codec can't encode character u'\xe4' in position 32: ordina
l not in range(128)
You can restart the import of "u_ex120513.log" from the point it failed by specifying --skip=214 on the command l
ine.
Attached are the following Lines 213-215 with the field specification for an review:
Fields:
#Fields: date time s-sitename s-computername s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken
Line 213:
2012-05-13 01:26:04 W3SVC7 XNW-WEB01 46.4.192.150 GET / - 443 - 213.133.113.84 HTTP/1.1 Hetzner+System+Monitoring - www.wolke.com 302 0 0 677 91 31
Line 214:
2012-05-13 01:26:36 W3SVC7 XNW-WEB01 46.4.192.150 GET /imagefilm/deutsch.swf - 80 - 216.246.45.86 HTTP/1.0 gsa-crawler+(Enterprise;+S5-HKQ2CJT3FSJJT;+info<a class='mention' href='https://github.com/tobaccopeople'>@tobaccopeople</a>.com) - www.m600.com 200 0 0 34601 236 468
Line 215:
2012-05-13 01:27:03 W3SVC7 XNW-WEB01 46.4.192.150 GET /de/kontakt/anfahrt - 80 - 95.211.139.1 HTTP/1.0 Mozilla/5.0+(compatible;+AcoonBot/4.10.8;++http://www.acoon.de/robot.asp) - www.wolke.com 200 0 0 21376 193 124
Command Line:
import_logs.py --debug --url=http://statistik.mlr-xnet.de u_ex120513.log --idsite=8 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --enable-reverse-dns
(In [6270]) Refs #703 Fixed an encoding issue with non-ascii paths or referrers.
Thanks for all your work and feedack.
AS per the last comments it seems IIS log parsing is now fully working which was the last critical open bug.
To all users listening here, is your log format now recognized as expected?
AS per comments in this thread the last remaining changes to make are:
Todo per comments
These sound like easy changes. Otherwise looks like the script is ready for prime time?
I don't think changing constants is a good idea. If your Piwik install is returning frequent errors, you'd have to find out why and fix it. Increasing the constants is, in my opinion, sweeping dust under the carpet.
I'll have a look at the static files issue, that doesn't seem normal at all, since the query string was supposed to be trimmed anyway.
I'll have a look at the static files issue, that doesn't seem normal at all, since the query string was supposed to be trimmed anyway.
I think query string is not stripped anymore by default (good).If your Piwik install is returning frequent errors, you'd have to find out why and fix it.
It's true but sometimes PHP errors are random and can happen frequently, so either we allow user to change it easily or better put safe defaults allowing happy user experience : )
I don't really like PHP, as you must know, but what do you mean by random errors? What kind of fatal error can be triggered randomly and still not be considered a bug?!
Anyway, change the constants if you think it's safer. But please at least change the logging.debug to something like logging.warning so that we don't silently fail. Otherwise, people may (and will) complain that the script is horribly slow, which is expected if it has to make each request to Piwik multiple times.
Regarding the query string, you're right, I'll try to fix that behaviour.
(In [6271]) Refs #703 Better handling of query strings.
reetz: can you try again with the latest commit? That should be fixed.
Last Build 6274 works like a charm with thee IIS logs. Thanks a lot for the good work.
Replying to Cyril:
reetz: can you try again with the latest commit? That should be fixed.
Hi, unfortunately now it is not working at all.
First I just replaces import_logs.py but now I did a complete reinstall with trunk-r6281 and any time I start
python /www/trunk/misc/log-analytics/import_logs.py --url=http://piwik.xxxxx.de /access-log-201203 --idsite=1
I got following result:
0 lines parsed, 0 lines recorded, 0 records/sec
Parsing log /access-log-201203...
Traceback (most recent call last):
File "/www/trunk/misc/log-analytics/import_logs.py", line 1287, in <module>
main()
File "/www/trunk/misc/log-analytics/import_logs.py", line 1251, in main
parser.parse(filename)
File "/www/trunk/misc/log-analytics/import_logs.py", line 1184, in parse
hit.path, hit.query_string = hit.full_path.split(config.options.query_string_delimiter, 1)
ValueError: need more than 1 value to unpack
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
1 lines parsed, 0 lines recorded, 0 records/sec
[...repeating...]
The same happens if I try to import your examples logs in /trunk/misc/log-analytics/tests/logs
Has there been some changes to command-line? On original 1.7.2-rc8 there are no problems with my logfile or your test-files
(In [6282]) Refs #703 Fixed a bug introduced in the IIS parsing refactoring, thanks reetz.
Replying to Cyril:
reetz: oops, that should be fixed, thanks for the report.
Hi,
Yes, it's working now. Many thanks.
Just one little thing: All Action-Urls in "Visitor Log" have a "?" at the end. It doesn't bother me, but perhaps other will find this "confusing"
By the way: is there a possibility to exclude certain pathes from being imported?
(In [6283]) Refs #703 Do not append the query string delimiter if there's no query string, thanks reetz.
Good catch, the query string delimiter was always appended since the latest refactoring. That should be fixed now.
It's indeed possible to exclude some paths: check out --exclude-path and --exclude-path-from.
After 17.3M good lines, got another encode error, script version 6283
Fatal error: 'ascii' codec can't encode character u'\u2013' in position 20: ordinal not in range(128)
One line log: http://mike.org.uk/test5_log.txt
(In [6295]) Refs #703 Fixed encoding bug with a non-ASCII user-agent.
Doesn't work on python 3. 2to3 fixes most of the stuff, there's still some little snags like the base64 decoder expecting bytes and getting a string.
It's not expected to work with Python 3. It may be supported later, but that's definitely not a requirement for a very first version.
Still, you're welcome to send patches to fix issues with 2to3, as long as they're rather simple (we wouldn't want to add much complexity just to support Python 3, at least not for now).
AFAIK everything that runs on 3 should run on 2.7+, so I'd think also developing against 3 would avoid a later bigger update.
Regarding patches: I'm no python dev and I don't have the resources to take care of possible problems with an interpreter not supported by upstream. I'll be happy to help when python 3 is supported, until then I'll use a python 2 slot for this.
Thanks for the effort though, not being able to import log files has been a major blocker for piwik here :-)
There are only 2-3 days left before release / freeze - is everyone happy with the script for a V1?
I've just updated to latest and realised the addition of the file.seek(0) functions stop you being able to pipe in logs through stdin: is it possible to disable that with a flag? It's really, really useful being able to pipe logs straight from apache.
oliverhumpage: seek is actually only used when the log format is autodetected, and I really can't see a way to avoid this (due to IIS). So you have to explicitely specify the format (with --log-format-name) when reading from stdin.
Let me know if that doesn't work (it should).
Ah, you're right - if I specify a regex or name then it stops complaining, which is fine. However, there is still a problem that no lines are being read from stdin. If I import from a file with a couple of lines in, I get results (it says lines have been parsed). However, if I specify the file as "-" and copy/paste those same lines, nothing gets logged, not even an error. I switched on --debug and did this for you:
# /path/to/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --config=/path/to/piwik/config/config.ini.php --url='http://piwik.local/' --recorders=1 --enable-static --log-format-name=common_vhost --debug -
2012-05-29 20:56:53,619: [DEBUG] Accepted hostnames: all
2012-05-29 20:56:53,778: [DEBUG] Piwik URL is: http://piwik.local/
2012-05-29 20:56:53,778: [DEBUG] No token-auth specified
2012-05-29 20:56:53,779: [DEBUG] No credentials specified, reading them from "/path/to/piwik/config/config.ini.php"
2012-05-29 20:56:53,780: [DEBUG] Using credentials: (login = admin, password = xxxxxxxxxxxxxxxxxxxxxxxxxx)
2012-05-29 20:56:54,067: [DEBUG] Authentication token token_auth is: xxxxxxxxxxxxxxxxxxxxxxxxxx
2012-05-29 20:56:54,067: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec
2012-05-29 20:56:54,068: [DEBUG] Launched recorder
Parsing log /dev/stdin...
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:19 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
www.domain.co.uk 9.8.7.6 - - [29/May/2012:12:41:21 +0100] "GET /robots.txt HTTP/1.1" 403 212 "-" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)"
0 lines parsed, 0 lines recorded, 0 records/sec
0 lines parsed, 0 lines recorded, 0 records/sec
^C
Logs import summary
-------------------
0 requests imported successfully
0 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
0 requests imported to 0 sites
0 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 6 seconds
Requests imported per second: 0.0 requests per second
As you can see, pasting the lines in did nothing. Importing the exact same lines but from a file gave:
Logs import summary
-------------------
0 requests imported successfully
2 requests were downloads
2 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
2 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Also, --show-progress appears to have got itself switched on even though it's not specified in the command-line (this happens with or without --debug).
Have I got something weird in my installation, or can you reproduce this?
(In [6403]) Fixes #3139 Adding new 'bots' parameter to the Tracking API. When set to 1 Piwik will record the request even if it is made by a bot (currently detected are only Googlebot and some Bing bots)
Refs #703 - Cyril, when --enable-bots is set, can you please make sure the parameter &bots=1 is also set to piwik.php request? only in this case though. Thanks!
EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!
Replying to matt:
EspadaV8 the bug is my fault, I packaged RC2 with a debug statement. Please try with with Rc3 it should work OK!
Awesome, RC3 seems to be importing everything nicely :) Thanks
We have now setup a demo of log analytics piwik
The demo at: http://demo-log-analytics.piwik.org/ has only 1 day of data for now.
It has 2 websites to show default import mode and full mode (with bots, files, errors, etc.)
I will now close this ticket as it is getting quite long, but reopened another one to keep track of the next features: #3163
Please post all new bug reports, feature suggestions in this ticket: #3163
(In [6433]) Refs #703 Set bots=1 accordingly.