In Piwik 1.8 we released the great new feature to import access logs and generate statistics.
The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!
New features
Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:
After that bots & crawlers detection would be much better.
PERFORMANCE'
How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.
Other tickets
Attachment: Document the debian vhost_combined format
vhost_combined.patch
Attachment: Force hostname patch
force_hostname.patch
Attachment:
README.apache_log_recorders.patch
Attachment: WinZip compressed file
u_ex120813.zip
Attachment: Sample IIS file for testing variations of c-ip field
test_c-ip_iis_log.log
Attachment: Log Parser README Update with Nginx Log Format for Common Complete
README_nginx_log_format.diff
Attachment: Log for WMS 9.0
WMS_20130523.log
Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do
cat /path/to/log | log_import.py [options] -
then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).
oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?
Counting Downloads:
In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.
Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.
A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.
Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.
Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.
Could you update the readme so the middle of the regex changes from:
\[(?P<date>.*?)\]
to
This will then make it all work.
Thanks,
Oliver.
(In [6471]) Refs #3163 Fixed regexp in README.
I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts
It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format.
What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.
In the current situation launching this:
python /var/www/piwik/misc/log-analytics/import_logs.py
--url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors
--enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts /var/log/apache2/example.com-combined.log
Just produces this:
Fatal error: the selected log format doesn't include the hostname:
you must specify the Piwik site ID with the --idsite argument
By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.
I'm not a piwik dev, but what I think you're trying to do is:
For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.
The way I'd do this is to write an import script which:
http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).
HTH (a real piwik person might have a better idea)
Oliver.
Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts.
And thanks for the links too, I'll have look.
It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.
The format is this:
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD]
See attached a small change to the README file.
After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname.
In case it's set, the value of host will be ALWAYS the one entered by --force-hostname.
This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)
(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.
(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.
Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.
Hi,
im not sure that im right in here for a ticket or problem?
I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313
Hi,
im on a shared webspace with ssh support. I try your import script to analyse my apache logs.
I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"
Example:
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
Fatal error: Forbidden
You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.
I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?
Regards
Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.
How do you now that?
It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).
Regards
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:
["requests":[url1,url2,url3],"token_auth":"xyz"]
I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)
Hi,
.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............
No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.
Replying to matt:
I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)
Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)
Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.
Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.
Suggested patch attached to this ticket.
Hi,
Is there a doc about the regex format for import_logs.py ?
We would like to import a file with awstat logFormat :
%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other
Thanks for your help,
Ludovic
I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?
Here is the format of my log files
access.log.%Y-%m-%d.log
Thanks a lot for all your great work!
The server log file analytics works great on my server.
I am using a lighttpd server and added the Accept-Language header to accesslog.format:
accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Accept-Language}i\""
(see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)
I wonder if it would be possible to add support for the Accept-Language header to import_logs.py?
So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.
Replying to Cyril:
(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.
Thanks for possibilities to import logs and also thanks for the log-hostname patch.
Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.
I'm having a similar problem to Hexxer. When I do a --dry-run I get no errors, but when adding to Piwik it falls over at about the same spot. It's not one offending log file or line of a log file that's causing it. I'll attach the output with debugging on below. I've run the script multiple times, by removing the line where the script has fallen over, removing the log file where it has fallen over etc. It always dies line ~9000-10000 in the 3rd log file.
I'm not sure if this is of interest but when doing a dry run the script does ~600lines/sec when importing to Piwik it does ~16,
The output file is here. Akismet was marking the attachment as spam
(In [6509]) Refs #3163 updated README to suggest increasing the --recorders value.
oliverhumpage: thanks, I've committed your diff.
ludopaquet: no doc yet, I suggest you take a look at the code, taking _COMMON_LOG_FORMAT as example.
lewmat21: I suppose each log line has its own date anyway, so it doesn't matter what the filename is.
sc_: I don't think using Accept-Language to guess the country is a good idea. As the header name says, it's about languages (locales), not countries. First, many languages are spoken in several countries. If the Accept-Language says you accept English, what country would you pick? Second, people can have Accept-Languages that don't match their country. I personnally surf using English as Accept-Languages, whereas I'm French and live in France.
law: can you reproduce the issue? If so, can you give me the access log as well as the full command line you used?
andrewc: can you edit line 741 and increase the 200 value to something like 10000? It will print the full error message instead of only the first 200 characters, which is not enough to get the Piwik error.
@Cyril:
As far as I know Piwik does the same when the GeoIP plugin isn't used:
http://piwik.org/faq/troubleshooting/#faq_65
The location is then guessed from en-us, fr-fr etc.
But the more important point is that it would be useful for website development to know what languages people use who visit my website. So it would be great if support for the Accept-Language header could be added.
Sorry for the wrong formatting (the preview didn't work).
Here is the correct link:
@Cyril:
Here's the output file with the full error messages.
Replying to andrewc:
@Cyril:
Here's the output file with the full error messages.
Sorry this is the link https://www.dropbox.com/sh/zat1m6lqphndpny/wH6n4mDaD6/output0907.txt
sc_: OK, I didn't know about this. Considering GeoIP will be integrated into Piwik soon (see #1823), which is a much better solution, I don't think we should modify the import script to use Accept-Language headers.
andrewc: your Piwik install (the PHP part) is returning errors:
Only one usage of each socket address (protocol/network address/port) is normally permitted
You need to find out why and fix it. It's unrelated to the import script.
Thanks for your great work, we gave import logs some time now and have few ideas/problems.
I don't know, should I open new tickets or write here?
One major thing is how to bring number of visitors/unique visitors down to make it more similar to javascript tracking and google analytics.
I understand that we don't have cookies and other config information to identify visitor.
We've managed to bring number of pageviews/actions down few times (from 5 times to 2 times more than javascript tracking). Or many more in few cases (like from 100 times more than javascript).
Our ideas and changes include (we assumed that we should get numbers as close as in javascript tracking):
few workarounds :)
We ended up at number of actions (pageviews) about twice the number of javascript without influence number of visitors (about 50% bigger than javascript).
Our extreme case is 300 views (javascript tracking) and 30 000 views with import script after changes - about 570 views with import script.
fjohn:
Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:
So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.
What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?
Replying to Cyril:
fjohn:
- why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers
But POST is also used by ajax requests all the time (and this is not what we would count with JS). We've just simplified that to drop anything other than GET.
- what kind of user-agent doesn't have OS data? Aren't they bots anyway?
for me question is - does "real user" always send OS data. On our logs there were for example, curl, python libs, xrumer, scrapers and many more odd requests that weren't on the bot list.
- limiting actions: that's on the PHP-side, I'll let matt answer this
yeap, it is. But we have a lot of bots that were not on list, don't know how it is working but they were in import log profile, not in javascript profile.
Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:
- it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is
I agree with you, we identified 2 of them (thumbs and minimizers) and we have very universal code for it - example (if picture and &w and &h) those identify 3 most popular thumb scripts (including those in wordpress and oscommerce).
We did it because on oscommerce shop we had 1000 more page views than on javascript - should we accept that?
- there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)
Does it with javascript tracking? From our tests not.
- most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).
We have only 2 more "if statements" on current FOR loops. Still you're right, that can grow :)
So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.
What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?
That could be a good idea, would be nice to test this on larger number of websites/scripts, we've tested 5 regular websites and few other scripts.
Ajax requests do not use POST all the time at all. For instance, jQuery (the most popular Javascript library) uses GET by default:
http://api.jquery.com/jQuery.ajax/
Regarding the rest of comments: just to make things clear, I wasn't advocating against what you did for your specific site, but against doing this by default in the script. I very much prefer to add options to the import script (it has quite many already) to allow users to customize it for their own needs rather than try to have sane defaults, which we really can't do as there's too much diversity on the Web :)
Cyril:
About Ajax - that is why we made limit of 100 page views per visitor. We found a case when one user made from 700 to 1000 views thanks to ajax by GET requests.
About whole thing. Sure, I understand that. But we wanted to use it for hosting company, and we are not making any "special case" we are trying to test log import on as many websites as we can.
So we just wanted to share some of our tests and ideas. In most cases everything works good, but wordpress or oscommerce are very popular.
Showing customers 30k views instead of 300 is not the best way to prove that import logs is working fine. On IPB forum we've had 5 times more pageviews, now less than twice JS.
@oliverhumpage and to all listening in this ticket: is there any other pending bug or important missing feature in this script?
Are you all happy with it? Note: we are working on performance next.
My Apache log gives hostnames rather than IP addresses. It looks like the import script sends the hostname, which the server side tries to interpret as a numeric IP value, with the result that all hostnames translate to 0.0.0.0. I added a call to socket.gethostbyname() in the import script, but it's undone all the performance gains I got through the bulk request patch.
Is there some simple fix that I'm missing here?
Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.
This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.
Because Piwik lacks the capabilities of tracking news feed subscribers (and I don't want to use feedburner) I would like to import the particular information from the Apache logs. All other web requests are tracked successfully by Piwik and I want the feed users information merged into the same Piwik website. For instance my news feed is located at www.domain.com/rss.xml, how can I import only the particular information into Piwik?
Hi guys,
We found one odd case.
On 2 servers (one dedicated and one vps) each new visit = new idvisitor (despite same configId).
BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.
Do you have any ideas why and how it supposed to work? I've spent some time in visit.php and when no cookie and visit less than 30 minutes = new idvisitor.
BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.
Could you somehow find an example of the log file showing the problem on both installations, with a few lines like 3 or 4 lines, to replicate the bug? this would help finding out the fix. Thanks
Yes matt , I will have them tomorrow (day off today) but how it should work? Should log parsing count unique visitors or not ?
I have activated logimport via apache macro to have live stats but wee have a 20 sites with high load and the problem that we have now ist that the acces via the url is blocking (30 or more import_log.py accessing piwik)
could we get some direct logimport that si not going throug the http interface ? and directly trhoug a console php load ?
thx Mario
and keep up the great work
Hi @all
We are testing the python import_logs.py script. Actually we are not able to import IIS log files which are compressed by WinZip or 7zip. If we unzip the archive befor running the script it works quiet well.
It seems the python script is not able to uncompress the files...
Attached an example archive
(In [6734]) Refs #3163, add integration tests (in PHP) for log importer.
(In [6737]) Refs #3163, modified log importer to use bulk tracking capability.
Notes:
(In [6739]) Refs #3163 - clarifying this option shouldn't be used by default
(In [6740]) Refs #3163, made size of parsing chunk == to max payload size * recorder count.
(In [6743]) Refs #3163
TODO:
(In [6745]) Fixing build? Refs #3163
(In [6749]) Refs #3163
(In [6756]) Refs #3163, show average records/s along w/ current records/s in log importer.
Replying to jamesvl011:
Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.
This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.
@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.
Adding "Heuristics to not track bot visits" in the ticket description.
If you have a suggestion or request for the script - or any problem or bug, please post a new comment here.
Replying to matt:
@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.
Matt -
The regex for c-ip (line 134 of import_logs.py when I looked at svn) ought to be like the line for User-Agent, allowing any text string without spaces:
'c-ip': '(?P<ip>\S+)'
I'm assuming the Piwik API can handle an IP address input as host name? If not, Python will have to do hostname lookups (preferably with its own mini-cache) as it parses the file.
I'll attach a file to this ticket with an example IIS log file that you can use for testing - it will have four rows, three with host names in the c-ip field and one with an IP address.
I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).
When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.
I'm sure this didn't happen in the older 1.7.2 version.
Can you replicate, and if so, is there an easy fix?
Thanks.
(In [6824]) Refs #3163, fix concurrency bug in import script where sites get created more than once when --add-sites-new-hosts is used.
Replying to oliverhumpage:
I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).
When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.
I'm sure this didn't happen in the older 1.7.2 version.
Can you replicate, and if so, is there an easy fix?
Just committed a fix for this bug. Can you use the file in svn?
(In [6826]) Refs #3163, added more integration tests for log importer & removed some unnecessary xml files.
Replying to capedfuzz:
Just committed a fix for this bug. Can you use the file in svn?
Perfect, that's fixed it - thank you.
Oliver.
I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.
When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)
C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.
The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?
Thanks a lot!
-Jo
Hi,
Nice module we're currently assessing.
I have 2 questions :
1/ We have several servers load balanced. Each server is generating its own log files, but for the same FQDN. How can we process and aggregate the log files within the same Website, as the log files need to be order by date ?
2/ Log files contain consumed bandwidth. Is it envisageable to enhance this module in order to parse and log this information ? Or if we need this information, should we consider to create a plugin ?
Thanks for your feedback.
The import_logs.py script should be able to handle and order the dates of your differents logs when computing statistics. It's the main purpose of the "invalidate" function within this script.
The best would be to import all your logs at once and then to run the archive job so that it can compute statistics for the "invalidated" dates.
Hi,
I try to use "import_logs.py" to parse the Java Play's log, log file sample as follows:
15.185.97.217 127.0.0.1 - - Sep 04 18:28:38 PDT 2012 "/facedetect?url_pic=http%3A%2F%2Ffarm4.staticflickr.com%2F3047%2F2699553168_325fb5509b.jpg" 200 345 "" "Jakarta Commons-HttpClient/3.1" 5683 ""
But the Python thrown: "invalid log lines"
Actually the Java Play's log file is similar with Lighttpd's access.log,Any easy way to adapter this Python file to parse other log file?
It was suggested by Matt that I add my issue to this ticket:
I'm running Piwik 1.8.3 on IIS 7. I've installed the GeoIP plugin, and also tweaked based on http://forum.piwik.org/read.php?2,71788. It is working. However, my installation is only tracking US-based visits.
My IIS instance archives its log hourly. I've attached one recent log for review, on the chance that it will contain clues as to why I'm only seeing US-based visits.
Attached log file is named u_ex12091212.log.
I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.
188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"
I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.
Thanks
To everyone with questions in this ticket, thank you for your bug reports. You can try to modify the script python to make it work for your log files. It's really simple code at the start of the script.
If you are stuck and need help, Piwik experts can help with any issue related to the log import script. Contact them at: http://piwik.org/consulting/
Otherwise, we may fix some of these requests posted here, but it might take a while..
We hope you enjoy Log Analytics!
Replying to jason:
I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.
188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"
I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.
Thanks
Last time, I have adapted the code base at "import_logs.py" esp for Java Play log file parsing successfully, then I think you should hard code to remove the hostname pattern with "http://" prefix, or string replace it.
Had a very minor problem with the script today:
I have daily log rotation enabled, and when no user visits a site on a given day, the log file for that day will be empty. This means the log format guessing fails, leading to an error.
Preferably, when a log file is empty, one would like to skip the file without throwing an error. This is easily achieved by changing the line that checks for log file existence to also check if the log file has contents:
`if not os.path.exists(filename) or os.path.getsize(filename) == 0:`
(In [7313]) Refs #3163 Adding libwww in excluded user agents, since libwww-perl is a common bot
As reported in: http://forum.piwik.org/read.php?3,95844
(In [7382]) Refs #3163: Log Parser README Update with Nginx Log Format for Common Complete, thanks to phikai.
(In [7383]) Refs #3163: don't fail to autodetect the format for empty files.
Hey guys, there have been many updates in trunk on the script, please let us know if your suggestion or report hasn't yet been committed.
Kuddos Cyril for the updates!
edit: Check also this ticket: #3558
For the record, with the current trunk, I can sustain 2000 requests/second in dry-run mode on a Xeon 2.7 GHz. And 1000 requests/second without dry-run, with --recorder=10 and the default payload (Piwik is installed on another server, 4 cores).
Not to say that you should get the same numbers as it depends on a LOT of factors (raw processing power, number of recorders, payload, PHP configuration, log files, network, etc.), but if you only get 50 requests/second and you have a strong machine, something is probably wrong.
Running with --dry-run is a good way to know how fast the Python script can go without really importing to Piwik, which already excludes many factors.
I am running Piwik 1.9.2 on a RHEL 5.7 server running Apache.
I am trying to implement the Apache CustomLog that directly imports into Piwik as described in this [url=https://github.com/piwik/piwik/blob/master/misc/log-analytics/README]README[/url]. I am not sure if I have a problem with my configuration or if there is a potential bug in the Piwik import_logs.py script. After some poking around on the command line it seems that the script works perfectly when it is given an entire file but when you try to feed it a single line from a log file it crashes. I have included my cmd output below for you to view. Any help would be greatly appreciated. Also if you need any additional information please let me know!!
Firstly let me pull the first line of my logfile to show its syntax:
[katonj<a class='mention' href='https://github.com/mimir2'>@mimir2</a>:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log
boarddev-beta.teradyne.com 131.101.52.31 - - [12/Nov/2012:11:16:24 -0500] "GET /boarddev/ HTTP/1.1" 200 10541 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"
Now when I run the file as the apache configuration suggests I get the following (Note: if I do not put the "-" at the end of the command the line from the logfile is ignore and the script simply outputs the README file):
[katonj<a class='mention' href='https://github.com/mimir2'>@mimir2</a>:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log | ./import_logs.py --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' -
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
Traceback (most recent call last):
File "./import_logs.py", line 1462, in <module>
main()
File "./import_logs.py", line 1426, in main
parser.parse(filename)
File "./import_logs.py", line 1299, in parse
file.seek(0)
IOError: [Errno 29] Illegal seek
And finally if I run the file itself through the script I get the following showing that it loves processing the logfile as long as it gets an entire file fed to it all at once:
[katonj<a class='mention' href='https://github.com/mimir2'>@mimir2</a>:log-analytics ] $ ./import_logs.py --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' boarddev-beta.teradyne.com.log
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log boarddev-beta.teradyne.com.log...
Purging Piwik archives for dates: 2012-11-12
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.
Logs import summary
-------------------
8 requests imported successfully
0 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
8 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 0 seconds
Requests imported per second: 24.01 requests per second
ottodude125: log detection + reading from stdin is actually not supported; you have to pick one. I'll fix the bug later on though.
When you setup the apache customlog you are piping the system log messages into the script as soon as they appear. This is the same as stdin right? I was just trying to simulate that process by running the head -1 on a log file to get a log message and piping that into the script.
Since auto format detection relies on having several lines to decode, it doesn't work on stdin (it tries to seek to points in the file, hence the "bug" - seek obviously fails on stdin).
When using stdin as the log source you have to use either --log-format-name or --log-format-regex flags on the command line to force a particular format. You might find --log-format-name="common_vhost" is what you want.
So you are complete right. Adding --log-format-name='common_vhost' to the command now allows a logfile to be read in from stdin on the command line. So running the following command works great from the command line:
[katonj<a class='mention' href='https://github.com/mimir2'>@mimir2</a>:applications ] $ head -8 babyfat | /hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -
As a side note I've tried the common_complete name and I tried using the --log-format-regex included in the readme and neither of them had any magical side effects either
Unfortunately porting that exact same thing into the apache http.conf file does not work. I have the configuration below and while the logfile "babyfat" gets populated piwik doesnt seem to process any input.
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" baby
CustomLog "|/hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -" baby
CustomLog logs/babyfat baby
Lastly the output logfile junk.log gets input when the command is run from the command line but the only time it gets populated from apache is when you add several -d to the CustomLog command and restart apache and then you get:
2012-11-13 15:44:12,517: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:12,517: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:12,517: [DEBUG] No token-auth specified
2012-11-13 15:44:12,517: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:12,520: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:13,249: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:13,249: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:13,249: [DEBUG] No token-auth specified
2012-11-13 15:44:13,249: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:13,251: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:14,341: [DEBUG] Authentication token token_auth is: 582b588b9568840fa6f1e208a8702b93
2012-11-13 15:44:14,342: [DEBUG] Resolver: dynamic
2012-11-13 15:44:14,342: [DEBUG] Launched recorder
I have the same issue as ottodude125. Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged.
EDIT: I noticed the log messages are appearing in the import_logs log when I restart apache. So it seem like this triggers either apache to send the messages to stdin or import_logs to read from stdin.
2EDIT: CustomLog with rotatelog works. So the issue must be the import_logs.py
I noticed in ottodude125's customlog, there's no path to the config file and no auth token: that would explain the errors shown in junk.log. You need to specify one or the other so that import_logs.py can authenticate itself to the piwik PHP scripts.
I'm wondering if the same problem is happening for elm's logs too? @elm, if that doesn't fix it, could you paste your customlog section here too?
There was another user in the forums reporting an error: view post
Could we explain the bug when it happens, and fail with a relevant error/notice message?
Here is my CustomLog line (line breaks for better reading):
CustomLog "|/var/www/piwik.skweez.net/piwik/misc/log-analytics/import_logs.py
--url=http://piwik.skweez.net/ --add-sites-new-hosts
--output=/var/www/update.skweez.net/logs/piwik.log --recorders=4
--log-format-name=common_vhost -dd -" vhost_combined
Here is the log that is generated:
...
2012-11-23 22:35:07,759: [DEBUG] Launched recorder
2012-11-23 22:35:07,761: [DEBUG] Launched recorder
2012-11-23 22:35:07,762: [DEBUG] Launched recorder
2012-11-23 22:35:07,763: [DEBUG] Launched recorder
2012-11-24 06:30:01,375: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,378: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,633: [DEBUG] Accepted hostnames: all
2012-11-24 06:30:01,633: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-24 06:30:01,633: [DEBUG] No token-auth specified
2012-11-24 06:30:01,633: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-24 06:30:01,648: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-24 06:30:02,065: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-24 06:30:02,709: [DEBUG] Site ID for hostname update.skweez.net: 7
Purging Piwik archives for dates: 2012-11-23 2012-11-24
2012-11-24 06:30:02,935: [DEBUG] Authentication token token_auth is: ...
2012-11-24 06:30:02,935: [DEBUG] Resolver: dynamic
2012-11-24 06:30:02,936: [DEBUG] Launched recorder
2012-11-24 06:30:02,938: [DEBUG] Launched recorder
2012-11-24 06:30:02,940: [DEBUG] Launched recorder
2012-11-24 06:30:02,941: [DEBUG] Launched recorder
Logs import summary
-------------------
5 requests imported successfully
14 requests were downloads
15 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
1 HTTP errors
0 HTTP redirects
14 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
5 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 28495 seconds
Requests imported per second: 0.0 requests per second
2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,724: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:03,104: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,136: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,141: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,372: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:03,372: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:03,372: [DEBUG] No token-auth specified
2012-11-25 06:33:03,372: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:03,373: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:03,492: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:03,492: [DEBUG] Resolver: dynamic
2012-11-25 06:33:03,493: [DEBUG] Launched recorder
2012-11-25 06:33:03,494: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
Purging Piwik archives for dates: 2012-11-25 2012-11-24
Logs import summary
-------------------
9 requests imported successfully
42 requests were downloads
42 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
3 HTTP errors
0 HTTP redirects
39 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
9 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 86580 seconds
Requests imported per second: 0.0 requests per second
Logs import summary
-------------------
0 requests imported successfully
0 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
0 requests imported to 0 sites
0 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 12 seconds
Requests imported per second: 0.0 requests per second
2012-11-25 06:33:16,016: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:16,016: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:16,016: [DEBUG] No token-auth specified
2012-11-25 06:33:16,016: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:16,017: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:16,156: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:16,156: [DEBUG] Resolver: dynamic
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder
So it is getting the logs when apache is reloading, which it does at night after logrotate.
Hi,
I would be glad, if you could add a new option to the script. It should only import the loglines with a specified path included. So do exactly the opposite of the --exclude-path-from option. As far as I understand we could just copy/paste the def check_path part and change the "True" and "False" return values. I posted the part with the changes.
def check_path(self, hit):
for include_path in config.options.included_paths:
if fnmatch.fnmatch(hit.path, included_path):
return True
return False
Unfortunately I don't know where to modify the script to add this option.
Many thanks for your help.
Hi all,
I am new to this piwik. So, I installed piwik on an apache webserver and I tried to import a log file from a Tomcat webserver but I get the following error:
Fatal error: Cannot guess the logs format. Please give one using either the --log-format-name or --log-format-regex option
This is the command that I used:
python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://192.168.1.100/piwik/ /home/user/app1/catalina.2012-12-10.log --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots
And this is what the log file contains:
Dec 10, 2012 12:02:50 AM org.apache.catalina.core.StandardWrapperValve invoke
INFO: 2012-12-10 00:02:50,000 - DEBUG InOutCallableStatementCreator#<init> - Call: AdminReports.GETAPPLICATIONINFO(?)
I tried googling it but I didn't find much. Also I tried the piwik forum but the same. Can you help me? What parameter shall use with --log-format-name or --log-format-regex option?
In trunk, when I CTRL+C the script, it does not exit directly, it takes 5-10 seconds before the software stops running an then outputs the log. I think it is a recent regression ?
Suggestion - Bandwidth Usage
I used to see it on awstats...
http://forum.piwik.org/read.php?2,98279,98330#msg-98330
There is no size information on logs, but i guess awstats check the acessed files on logs, and count it.
For piwik.php performance improvements and asynchronous data imports, see #3632
Has anyone found a solution to this yet? I'm having the same problem with my IIS logs not importing.
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log Z:\logs\W3SVC14\u_ex121218.log...
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 43 lines recorded, 14 records/sec (avg), 43 records/sec (curr
ent)
1648 lines parsed, 43 lines recorded, 10 records/sec (avg), 0 records/sec (curre
nt)
Fatal error: None
You can restart the import of "Z:\logs\W3SVC14
\u_ex121218.log" from the point it failed by specifying --skip=3 on the command
line.
Replying to unaidswebmaster:
I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:
Fatal error: None You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.
When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)
C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152 01 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) Parsing log d:\tmp\logfiles\ex120803.log... 182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) Fatal error: None You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.
The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?
Thanks a lot!
-Jo
Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.
I am running into the same problem as Jo as well. Please let me know if there are any suggestions or possible solutions. We have been trying to diagnose the problem for a couple days but still have not found a solution. Thanks.
Replying to wpballard:
Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.
One thing I've noticed is that --dry-run works perfectly. That might help narrow down where the problem is. Likely in the code that commits the changes to the DB.
Hey Folks,
Glad to see there is good interest in the log file processing.
The first feature I would like to see added is the opposite of --exclude-path, would be --include-path
In our architecture we have MANY web assets under a single domain and weblogs are done by domain. This is out of our control. This would include multiple applications, API's, and web services. it would be nice to process the log files by including only the paths we want. The exclusion route is just cumbersome as each call would require 5-10 excludes instead of a single include.
The Second Feature I would like to see is support for the XFERLOG format (http://www.castaglia.org/proftpd/doc/xferlog.html) for handling FTP logs.
much of our business is based on the downloading of data and files via FTP, so these types of stats and analysis is valuable.
The third feature I would like to see added today is the ability to process log files rotated on a monthly basis. I know this goes contrary to the recommendations however in our business we do not manage the IT infrastructure, only the line of business services and apps on top of that infrastructure.
Currently I am handling this by way of a BASH script. Before I process the log file I count the number of lines (using $wc -l) then I store that in a loglines.log file. The next time I run the script I use tail on the loglines.log and grab the last line count ans use that to populate --skip param.
To capture the monthly log rotation if the current wc -l is less than loglines.log then I set --skip to zero (0).
it is crude, but works. Having this built in native python would be fairly straight forward and allow for support of rotating monthly.
The added bonus is that the same log file can be processed multiple times in a day even for daily rotated logs. This is a happy compromise between real-time javascript and daily log processing, especially for high volume sites with huge log files.
Cron is handy for this.
Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.
dsampson: agree for the --include-path suggestion. I'll add it later.
FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?
Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?
The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.
See comments inline...
Replying to Cyril:
dsampson: agree for the --include-path suggestion. I'll add it later.
Thanks for this. Appreciated
FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?
For those of us in the big data business a FOSS solution offering all the features of piwik for FTP would be great. An unlikely fork, so thought it could be a posible feature.
Working on the regex for XFERLOG. having trouble re-casting a new regex group based on values of other groups. For instance the date field is not a clean YYYY-MM-DD so I need to figure out how to create a regex group based on values of three other regex groups. I am a regex greenhorn for sure.
Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?
I do it by comparing the last line count to the new one. for instance #linesyesterday will be greater than #linestoday if the logfile has been rotated. I have done logfiles in python using just regular text files in the past. They get big but the head can be severed when it gets too big. a no-sql db approach or data object could also work.
The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.
This would be a good alternative with some hit on performance.
Thanks again for the reply
Did either of these feature make it into the latest 1.10.1 release?
Replying to dsampson:
See comments inline...
Replying to Cyril:
dsampson: agree for the --include-path suggestion. I'll add it later.
Log Rotation: The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if > a log line has already been imported, so that you can basically reimport any log file at any time, > and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe
and easy to use.
Working on the regex for XFERLOG.
Here is my first cut, however the DATE field will not be recognized. Dates in XFERLOG are not like those in Apache logs. I am not sure how to concatenate these groups based on other named groups.
I included some test strings. yes I used the public Google DNS for IP's for privacy reasons.
I captured everything I could according to the EXFER documentation. perhaps overkill but the best way I knew to work through the expression. manpage for XFERLOG here (http://www.castaglia.org/proftpd/doc/xferlog.html)
I also provided the example script call and the output from the script.
Looks like the issue is the DATE group. no surprise. But again I am not sure how to contruct it based on the input.
Any thoughts are appreciated
--------------TEST STRINGS-------------------
Mon Nov 1 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250kdem/026/026a.zip b o a User@ ftp 0 *
Thu Nov 10 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250kdem/026/026a.zip b o a User@ ftp 0 * c
Tue Jan 1 14:12:36 2013 1 8.8.4.4 88048 /pub/cantopo/250k_tif/MCR201001.tif b o a ftp@example.com ftp 0 * i
Tue Jan 1 14:15:57 2013 4 8.8.4.4 8769852 /pub/geott/ess_pubs/211/211354/gscof_3759r_b_2000mn01.pdf b o a googlebot@google.com ftp 0 * c
Tue Jan 1 16:06:49 2013 11 8.8.4.4 7198877 /pub/toporama/50k_geo_tif/095/d/toporama_095d02geo.zip b o a user@server.com ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m2.tif b o a googlebot@google.com ftp 0 * c
Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m2.tif b o a googlebot@google.com ftp 0 * c
Tue Jan 1 06:59:59 2013 1 8.8.4.4 1679 /pub/geott/eo_imagery/gcdb/W073/N60/N60d50mW073d40m1.summary b o a googlebot@google.com ftp 0 * c
Tue Jan 1 07:02:53 2013 1 8.8.4.4 168087 /pub/geott/eo_imagery/gcdb/W108/N50/N50d58mW108d28m3.tif b o a googlebot@google.com ftp 0 * c
Tue Jan 1 07:04:39 2013 1 8.8.4.4 16958 /pub/geott/cli_1m/e00pro/fcomfins.gif b o a googlebot@google.com ftp 0 * c
--------------REGEX Expression-----------------
(?x)
(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s
(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s
(?P<day>[\d]{1,})\s
(?P<time>[\d+:]+)\s
(?P<year>[\d]{4})\s
(?P<unknown>[\d]+)\s
(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s
(?P<length>[\d]{,})\s
(?P<path>/[\w+/]+)/
(?P<file>[\w\d-]+.\w+)\s
(?P<type>[a|b])\s
(?P<action>[C|U|T|_])\s
(?P<direction>[o|i|d])\s
(?P<mode>[a|g|r])\s
(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s
(?P<service>[\w]+)\s
(?P<auth>[0|1])\s
(?P<userid>[*])\s
(?P<status>[c|i])
(?P<stuff>)
----------------Script Call----------------
./misc/log-analytics/importlogs.py --url=http://PIWIKSERVER --token-auth=AUTHSTRING --output=proclogs/procFtpPiwik.log --enable-reverse-dns --idsite=17 --skip=0 --dry-run --log-format-regex="(?x)(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s(?P<day>[\d]{1,})\s(?P<time>[\d+:]+)\s(?P<year>[\d]{4})\s(?P<unknown>[\d]+)\s(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s(?P<length>[\d]{,})\s(?P<path>/[\w+/]+)/(?P<file>[\w\d-]+.\w+)\s(?P<type>[a|b])\s(?P<action>[C|U|T|])\s(?P<direction>[o|i|d])\s(?P<mode>[a|g|r])\s(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s(?P<service>[\w]+)\s(?P<auth>[0|1])\s(?P<userid>[*])\s(?P<status>[c|i])(?P<stuff>)"-
-------------Script output ----------------------
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log logs/ftpLogsJunco/xferlog2...
Traceback (most recent call last):
File "./misc/log-analytics/import_logs.py", line 1411, in <module>
main()
File "./misc/log-analytics/import_logs.py", line 1375, in main
parser.parse(filename)
File "./misc/log-analytics/import_logs.py", line 1299, in parse
date_string = match.group('date')
IndexError: no such group
@ottodude125 and @elm: I have the same issue and reported it as a separate ticket here: #3757#ticket
How to exclude more than 150 user's visits on site?
Replying to Cyril:
Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.
My web logs have additional fields logged. Some of these do resolve/transfer over when using AWStats, others are excluded in AWStats with %other% values. I tried to exclude the additional field data by creating new, but unused lines in the import iis format section but was not able to get past the error "'IisFormat' object has no attribute 'regex'". Forum/web searches bring this up as a common problem but I haven't found a fix. Any suggestions? Sample log file inline.
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-02-23 00:00:01
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-02-23 00:00:01 192.168.1.202 GET /pages/AllItems.aspx - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 200 0 0 499
2013-02-23 00:00:01 192.168.1.202 GET /pages/logo.jpg - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 304 0 0 312
Piwik Log Analytics is now being used by hundreds of users and seems to be working well! We are always interested in new feature requests and suggestions. You can post them here and if you are a developer, please consider opening a pull request
Hi,
The log analytic script does not accept any time argument.
Thus, is it assumed that the log files to be processed have already been filtered ( timestamp range) in order to avoid duplicate processing ?
Thanks.
Hi
I've been trying to import some logs from a tomcat/valve access log.
According to this http://tomcat.apache.org/tomcat-5.5-doc/config/valve.html, my app server.xml define
<Valve className="org.apache.catalina.valves.AccessLogValve" directory="/sillage/logs/performances" pattern="%h %l %u %t %r %s %b %D Referer=[%{Referer}i]" prefix="access." resolveHosts="false" suffix=".log"/>
Here is a couple of line from one of my access-datetime.log
10.10.40.85 - - [08/Apr/2013:11:02:49 +0200] POST /...t.do HTTP/1.1 200 39060 629 Referer=[http://.....jsp]
10.10.40.60 - - [08/Apr/2013:11:02:49 +0200] GET /...e&typ_appel=json HTTP/1.1 200 2895 2 Referer=[-]
10.10.40.85 - - [08/Apr/2013:11:02:48 +0200] POST /...r.jsp?cmd=tracer HTTP/1.1 200 90 63 Referer=[http://....jsp]
Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual
%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?
(guess reading README exemple...) would help. Maybe...
Replying to lyrrr:
Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual
%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?
(guess reading README exemple...) would help. Maybe...
If you're using --log-format-regex on the command line then I don't think the escaping is necessary. It's only if you're piping directly to piwik via (in my case) apache's ability to send logs to programmes that you need to work out how to do the multiple-escape thing.
I'll try tomorrow, but I'm skeptical: I copied the \ stuff from the README.md example.
I've just double-checked the README.md, and the only time I can see that weird escaping is in the bit I wrote called "Apache configuration source code". It's meant to be apache config, not CLI - apologies if that's not clear.
You may need to put a bit of escaping in depending on your shell, but nowhere near the amount that apache requires (since you've got to escape the initial parsing of the config file, then the shell escaping as it runs the command, and still be left with backslashes).
I think if you single quote it's mostly OK, i.e. with tcsh or bash
--log-format-regex='(?P<host>[\w...])'
would pass the regex in unscathed, or with my copy of ancient sh you just need one extra backslash, i.e.
--log-format-regex='(?P<host>[\\w...])'
etc.
HTH
Maybe we are missing a few examples in the doc for how to call the script. Would you mind sharing your examples, if you're reading this?
we will add such help text in the README.
Okay, finaly this worked:
python misc/log-analytics/import_logs.py --url=http://localhost/piwik log_analysis/access.2013-04-02.log --idsite=1 --log-format-regex='(?P<ip>\S+) (?P<host>\S+) (?P<user_agent>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] (?P<query_string>\S*) (?P<path>\S+) HTTP\/1\.1 (?P<status>\S+) (?P<length>\S+) (?P<time>.*?) (?P<referrer>.*?)'
This would be an interesting example for your doc I guess
I now have to play with piwik to ponder the relevancy of the tool in my use case (analyzing client's call to a server managing schedules, client's information, etc; to get a better idea, a big picture on topic like network/database/cpu).
I guess I'm not very clear and twisting piwik out of the "web analyzis" intended usage. Any suggestion on this topic is welcome.
Last technical thing for this post: my time fiels is millisecond, not second. How to specify that?
Thanks for the help!
I have set this up on a varnish server that is logging through varnishncsa
. However, the requests that varnish logs include the host name as the "request."
123.456.78.9 - - [23/Apr/2013:07:05:51 -0400] "GET http://asite.org/thing/471 HTTP/1.1" 200 13970 "http://www.google.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
When I import this with import_logs.py
, piwik was registering hits at http://asite.org/http://asite.org/thing/471
so I was able to work around this by using the log-format-regex
parameter.
--log-format-regex='(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ https?://asite\.org(?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'
It would be great if this were more directly supported and documented (varnishncsa
tracking through import_logs.py
). I suspect my method isn't ideal for situations where more than one site is being cached with varnish and also if visitors to those sites are being logged by piwik. This method probably only works with one domain.
Hi bangpound,
I'm not a piwik dev so I can't comment on including a varnishncsa in the import_logs.py itself, but if you change your regex slightly to replace
https?://asite\.org```
with
(?P<host>https?://[^/]+)```
then that will pick up the hostname of the site and therefore work well with multiple vhosts (either define them in piwik in advance, or use --add-sites-new-hosts to add them automatically).
Hope that helps.
Similar to cdgraff's request:
Feature Request: Support WMS (Windows Media Services) logs. Currently we use Awstats but it would be great to be able to move it to PIWIK.
I have attached a sample of WMS version 9.0 log file: WMS_20130523.log
I've noticed that the local time for imported logs is not set correctly. Is this correct or am I doing something wrong?
It seems as if Piwik is using the timezone of the web server that created the logs to set the local visitor time. I don't know if this is part of the importer or part of Piwik itself, but I would like to see the local visitor time be related to the timezone the visitor actually is based on their IP geoIP. It should be possible either using approximation based on longitude and latitude or by using a database like GeoNames.
Hey Folks,
Thought I would inform this thread that I have been working on a batch loading script for those of us that require some extra features such as remembering how many lines in a log were processed. The major use case is for people running scripts through cron jobs on log files roted monthly, but they want to run the stats daily or more frequently than monthly.
You can check out the branch development of batch-loader.py for piwik here:
https://github.com/drsampson/piwik/tree/batch-loader/misc/log-analytics/batch-loader
I would love some testers and feedback. Read the readme here for an overview:
https://github.com/drsampson/piwik/blob/batch-loader/misc/log-analytics/batch-loader/readme.md
Developer notes:
This work is a branch of a forked version of piwik. My goal is to someday make a pull request to integrate in piwik. So piwik developers are encouraged to comment so I can prepare.
dsampson: I've had a very quick look at your script. The core feature, which is keeping track of already imported log lines, should be done in Piwik itself, as detailed by Matt on this ticket. Using a local SQLite database is an inferior solution.
Your Python code could be better. A few suggestions:
Thanks for the feedback.
As for developing in Piwik. Python is the extent of this geographers hacking skills. I thought since this was not being done within PIWIK I would create a homebrew solution. Then I convinced myself to offer it back to the community for those who could use it.
Perhaps it will inspire someone to do it the right way within piwik, which would be awesome. Right now it keeps me out of the piwik internals, which is probably best for everyone (smile).
String formatting was a general tip to avoid multiple concatenations. Indeed, it should NOT be used for SQL requests with unfiltered input.
As for having a proper solution to your problem, you might try harassing Matt so that he implements it into Piwik :) Just kidding, but I would LOVE to have it!
Thanks for your submission of this tool that enhances log analytics process use cases.
For the particular "log line skip" feature, Why in core? because if several servers call Piwik, you are in trouble with the SQLite database. Better re-use Piwik datastore to keep track of dupes :)
Here is my updated proposal implementation.
Matt,
I agree with you that getting it into core would be best. Having this solution means I could possibly dissolve my forked project. Again if I was a PHP and MySQL developer I would love to help. As a geographer, scripting is done on the side to handle special use cases.
For clarification of the use case for this script, it is launched independent of piwik. By that I mean the script will likely reside on a log server somewhere, not the PIWIK server. The script is called likely through a cron job. Since there will only be a single instance of the script run on any server you won't run into collides with multiple servers using it. If you need multiple instances then you will have each with an independant sqlite DB. That is why I used SQLITE because you have only one client accessing the client at any one time.
Let me know when these features are added to core and I will dissolve my fork.
Good luck.
Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.
Apache Log format is as follow :-
LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus
Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.
Regular expression when importing log used is as followed-
--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'
This works for regular log lines where there is only 1 IP address...
Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.
--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'
I will be nice if additional support to handle additional x-forwarded-for instead.
Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.
Apache Log format is as follow :-
LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus
Sample Log:-
smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"
if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.
Regular expression when importing log used is as followed-
--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'
This works for regular log lines where there is only 1 IP address...
Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.
--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'
I will be nice if additional support to handle additional x-forwarded-for instead.
If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.
Correct me if I am wrong... I'm pretty new to piwik... had used awstats previously.
This would be possible if the log was not older log. We are talking about handling IMPORTING log and not existing...it makes not much sense o me if to ask user to use mod_rpaf when their aim is to import older logs which had not implemented that.
The aim of import is to import older logs...for current tracking, this can already be done by piwik itself.
Replying to Cyril:
If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.
Assuming you want the last IP in the list (and also that you trust the last IP in the list - this is why mod_rpaf is the best idea since you can prevent clients spoofing IPs):
--log-format-regex='(?P<host>[\w-.])(?::\d+)? (?:\S+?, )(?P<ip>\S+)/)
If you want to capture proxy information, I don't think piwik supports that, so you'd need to set up a separate site with an import regex that captures the first IP in the list instead.
Think the main point here is to "IMPORT" Existing log. For new log, it can be implemented easily as it is all done in java script.
As for "I don't get why that won't work with a custom regexp?" Any idea how/what the regexp can be...sorry I am no expert for regex...which is why I ended up having to process the log twice... and modifying the python script.
Hi,
I'm testing the import, and ran the python script twice on the same log file.
It looks like the same log file was processed twice.
Does it mean I have to handle on my own the log file history ?
Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?
Thanks,
Axel
Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?
Correct. we would like to add this feature at some point. If you can sponsor it, get in touch!
There was a patch submitted to keep track of imported files
Hi,
my box won't properly process log entries passed to stdin of import_logs.py. When i read the exact same entries from a file, everything works great. I am using nginx_json formatted entries. I have tried in dry run mode and normal - each time i read from stdin i get the following output (nothing imported). Can anyone get this setup to work via stdin?
Thank you for your help!
Test data:
{"ip": "41.11.12.41","host": "www.mywebsite.com","path": "/","status": "200","referrer": "http://"www.mywebsite.com/previous","user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/32.0.1700.107 Chrome/32.0.1700.107 Safari/537.36","length": 3593,"generation_time_milli": 0.275,"date": "2014-03-12T22:41:23+01:00"}
Python script parameters:
--url=http://piwik.mywebsite.com
--idsite=1
--recorders=1
--enable-http-errors
--enable-reverse-dns
--enable-bots
--log-format-name=nginx_json
--output
2014-03-12 23:29:37,251: [DEBUG] Accepted hostnames: all
2014-03-12 23:29:37,252: [DEBUG] Piwik URL is: http://piwik.mywebsite.com
2014-03-12 23:29:37,252: [DEBUG] No token-auth specified
2014-03-12 23:29:37,252: [DEBUG] No credentials specified, reading them from "the config file"
2014-03-12 23:29:37,374: [DEBUG] Authentication token token_auth is: a really beautiful token :)
2014-03-12 23:29:37,375: [DEBUG] Resolver: static
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-03-12 23:29:37,532: [DEBUG] Launched recorder
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 requests imported successfully
0 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
0 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Total time: 10 seconds
Requests imported per second: 0.0 requests per second
Jadeham,
Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line).
I have a similar problem to Jadeham.
I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:
import sh
from sh import tail
run = sh.Command("/usr/bin/python")
run = run.bake("/var/www/piwik/misc/log-analytics/import_logs.py")
run = run.bake("--output=/home/XXX/piwik_live_importer/piwik.log")
run = run.bake("--url=http://X.X.X.X:8081/piwik/")
run = run.bake("--idsite=1")
run = run.bake("--recorders=1")
run = run.bake("--recorder-max-payload-size=1")
run = run.bake("--enable-http-errors")
run = run.bake("--enable-http-redirects")
run = run.bake("--enable-static")
run = run.bake("--enable-bots")
run = run.bake("--log-format-name=nginx_json")
run = run.bake("-")
for line in tail("-f", "/var/log/nginx/access_json.log", _iter=True):
run(_in=line)
The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:
Parsing log (stdin)...
Purging Piwik archives for dates: 2014-04-16
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.
Logs import summary
-------------------
1 requests imported successfully
2 requests were downloads
0 requests ignored:
0 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
1 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 0 seconds
Requests imported per second: 44.04 requests per second
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:
Logs import summary
-------------------
94299 requests imported successfully
145340 requests were downloads
84140 requests ignored:
84140 invalid log lines
0 requests done by bots, search engines, ...
0 HTTP errors
0 HTTP redirects
0 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname
Website import summary
----------------------
94299 requests imported to 1 sites
1 sites already existed
0 sites were created:
0 distinct hostnames did not match any existing site:
Performance summary
-------------------
Total time: 1147 seconds
Requests imported per second: 82.21 requests per second
Is there any way to know why these records are not shown and which are the records that are being marked as invalid?
Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters
To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.
Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters
Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks
Replying to Hexxer:
Hi,
.............
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
.............No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.
Replying to matt:
I suppose we can do some basic test to see which value works best?
Maybe 50 or 100 tracking requests at once? :)Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)
Wow. 23 months have passed, and still no solution to this problem???
I'm getting the same error, and there's no docco anywhere to tell me how to fix it:
The url is correct (I copy and paste it into my browser, and it gives me the Piwik login screen), and the apache error logs show nothing from today. Here's my console output:
$./import_logs.py --url=https://www.mysite.com/pathto/piwik/ /var/log/apache/access.log --debug
2014-04-28 00:10:29,205: [DEBUG] Accepted hostnames: all
2014-04-28 00:10:29,205: [DEBUG] Piwik URL is: http://www.mysite.com/piwik/
2014-04-28 00:10:29,205: [DEBUG] No token-auth specified
2014-04-28 00:10:29,205: [No credentials specified, reading them from ".../config/config.ini.php"
2014-04-28 00:10:29,347: [Authentication token token_auth is: REDACTED
2014-04-28 00:10:29,347: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:29,349: [DEBUG] Launched recorder
Parsing log [...]/log/apache/access.log...
2014-04-28 00:10:29,350: [DEBUG] Detecting the log format
2014-04-28 00:10:29,350: [DEBUG] Check format icecast2
2014-04-28 00:10:29,350: [DEBUG] Format icecast2 does not match
2014-04-28 00:10:29,350: [DEBUG] Check format iis
2014-04-28 00:10:29,350: [DEBUG] Format iis does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common
2014-04-28 00:10:29,351: [DEBUG] Format common does not match
2014-04-28 00:10:29,351: [DEBUG] Check format common_vhost
2014-04-28 00:10:29,351: [DEBUG] Format common_vhost matches
2014-04-28 00:10:29,351: [DEBUG] Check format nginx_json
2014-04-28 00:10:29,351: [DEBUG] Format nginx_json does not match
2014-04-28 00:10:29,351: [DEBUG] Check format s3
2014-04-28 00:10:29,352: [DEBUG] Format s3 does not match
2014-04-28 00:10:29,352: [DEBUG] Check format ncsa_extended
2014-04-28 00:10:29,352: [DEBUG] Format ncsa_extended does not match
2014-04-28 00:10:29,352: [DEBUG] Check format common_complete
2014-04-28 00:10:29,352: [DEBUG] Format common_complete does not match
2014-04-28 00:10:29,352: [DEBUG] Format common_vhost is the best match
2014-04-28 00:10:29,424: [Site ID for hostname www.mysite.com not in cache
2014-04-28 00:10:29,563: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:31,612: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-04-28 00:10:33,657: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden
Fatal error: Forbidden
You can restart the import of "[...]/var/log/apache/access.log" from the point it failed by specifying --skip=5 on the command line.
And of course, trying with --skip=5 produces the same error.
I have googled, I have searched the archives, the bug tracker contains no clue. Would really appreciate some kind soul taking mercy on me here.
Piwik: HTTP Error 403: Forbidden
Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).
Replying to matt:
Piwik: HTTP Error 403: Forbidden
Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).
Apache error log shows only a restart once every hour. I am unable to configure Apache directly, as I am running Piwik on Gandi.net's "Simple Hosting" service. I have repeatedly begged gandi support to look into this matter, but their attitude is (and not unreasonably) that their job is not to support user installation issues like this. If you can give me ammunition that shows it really is Gandi's fault, then maybe we can move forward here.
Or maybe it's just a Piwik bug. Or I'm doing something wrong. I don't know.
f
@foobard I suggest you create a new ticket for your particular issue, and we will try help you troubleshoot it (maybe we need to get access to server to reproduce and investigate). Cheers!
Please do not comment on this ticket anymore. instead, create a new ticket and assign it to "Component 'Log Analytics (import_logs.py)'
Here is the list of all tickets related to Log Analytics improvements: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py)
Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues
refs #7163