Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_logs.py and IIS/w3c date format #6968

Closed
kevinjc opened this issue Jan 9, 2015 · 8 comments
Closed

import_logs.py and IIS/w3c date format #6968

kevinjc opened this issue Jan 9, 2015 · 8 comments
Assignees
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@kevinjc
Copy link

kevinjc commented Jan 9, 2015

I am using the import_logs.py and feeding it IIS w3c based logs via STDIN from my syslog-ng collector. Syslog-ng is configured to send only the message to the wrapper script.
The wrapper script is configured to give the python script the regex to match format since the log format is not everything but just the important fields. I think the w3cextended class in the python script is configured to look for the header fields through file seek, so STDIN would probably have to use regex anyway.
However, the problem is that the date does not seem to validate known date formats by the python script. Here is my regex pattern and the debug output:

--log-format-regex='(?P^\d+[-\d]+\s\d+[:\d+]+) (\S+) (?P.?) (?P<query_string>\S) (?P\S+) (?P[\d_.]) (?P<user_agent>.?) (?P._?) ((?P[\w-.]*)(?::\d+)?) (?P\d+) (?P\S+) (?P<generation_time_secs>\d+)' \

2015-01-09 08:19:09,286: [DEBUG] Invalid line detected (invalid date): 2015-01-07 19:47:50 GET /tenbanana/rifd/_scripts/showGrid.js _=1420660073409 - 192.168.86.240 Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:31.0)+Gecko/20100101+Firefox/31.0 https://mytestserv2.local/oranges/rifd/?fuseaction=planData.incView&selectedInc=10&nav_type=E&nav_link=manf_d mytestserv1.local 200 8602 14

@diosmosis
Copy link
Member

The reason the regex fails is because the regex does not use named groups. The script doesn't know which group is the 'date' group (as well as all the other required groups).

I'll run some tests to see if the script will work when supplying logs via stdin. If not, I'll see if I can get it to work.

@diosmosis diosmosis self-assigned this Jan 10, 2015
@mattab mattab added this to the Piwik 2.11.0 milestone Jan 12, 2015
@mattab mattab added the Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. label Jan 12, 2015
@diosmosis
Copy link
Member

@kevinjc I modified the importer so logs in the W3C extended log file format can be imported from stdin. To do this, run the script with the --log-format-name=w3c_extended option, eg:

cat myw3clogs.log | ./misc/log-analytics/import_logs.py --url=http://localhost/ --log-format-name=w3c_extended -

@kevinjc
Copy link
Author

kevinjc commented Jan 12, 2015

Copied latest and ran quick test with the following error:
Parsing log (stdin)...
Traceback (most recent call last):
File "/tools/piwik/scripts/import_logs.py.new", line 1916, in
main()
File "/tools/piwik/scripts/import_logs.py.new", line 1887, in main
parser.parse(filename)
File "/tools/piwik/scripts/import_logs.py.new", line 1719, in parse
match = format.match(line)
File "/tools/piwik/scripts/import_logs.py.new", line 178, in match
self.matched = self.regex.match(line)
AttributeError: 'NoneType' object has no attribute 'match'

Using the following configuration in the wrapper script:
exec python /tools/piwik/scripts/import_logs.py.new
-snipped-
-dd \

--log-format-name=w3c_extended --w3c-time-taken-millisecs \

@diosmosis
Copy link
Member

@kevinjc I recognize the error and it shouldn't occur w/ the code in master... Can you provide an example log file w/ one or two log lines (please include the #Fields: line)?

@kevinjc
Copy link
Author

kevinjc commented Jan 12, 2015

I cannot count on the #Fields line to be present when it runs since it is a constant stream from the syslog-ng collector. This is one of the reasons why I was thinking I needed to pursue the regex option. I'm happy not to (use regex) if that is possible!

Note: I have turned off a few of the fields but I can re-enable them if it is necessary:
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2015-01-12 04:30:32
#Fields: date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs(User-Agent) cs(Referer) cs-host sc-status sc-bytes time-taken
2015-01-12 04:30:32 GET /bananas otherfruits+vegetables - 192.168.86.240 Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:31.0)+Gecko/20100101+Firefox/31.0 https://test2.testserver.local/INTRO/? my.testserver.local 200 4522 968

@kevinjc
Copy link
Author

kevinjc commented Jan 12, 2015

I now have snippet of logs with the commented lines to run for testing. It gets closer but now I am getting a 500 error on import from my Apache server running piwik. Is this because it has not updated to 2.10.0? Is this version required? I thought the import_logs.py was a little more independent of the version on the server, but that just may have been an assumption on my part.
Error on server is:
[Mon Jan 12 15:02:03 2015] [error] [client 192.168.99.30] PHP Fatal error: Class 'Piwik\DataTable\Renderer\Json2' not found in /application/piwik/core/DataTable/Renderer.php on line 203

@diosmosis
Copy link
Member

I cannot count on the #Fields line to be present when it runs since it is a constant stream from the syslog-ng collector. This is one of the reasons why I was thinking I needed to pursue the regex option. I'm happy not to (use regex) if that is possible!

The #Fields line is necessary in order to build the regex used to parse log lines... You could get away w/ supplying a regex, but this approach is more error prone; small mistakes in the regex can cause problems that are hard to diagnose. I think I can create a middle ground, however. I'll add a new option --w3c-fields so you can specify the fields format in the log importer command.

Is this because it has not updated to 2.10.0? Is this version required? I thought the import_logs.py was a little more independent of the version on the server, but that just may have been an assumption on my part.

This is the intent, however, changes made to Piwik's reporting and tracking APIs (both of which the log importer depends on) can create incompatibilities between log importer and Piwik versions. It is of course recommended to update to the newest available version, but you can work around this specific error by changing line 1028 to 'format' : 'json', (maybe. It's possible the bugs this change fixed in Piwik core will cause other problems for the newest importer).

diosmosis pushed a commit that referenced this issue Jan 13, 2015
…3C extended log file format can be imported from stdin w/o a '#Fields:' line being present.
@diosmosis
Copy link
Member

@kevinjc You should be able to import logs w/o the '#Fields:' line w/ the following options:

--log-format-name=w3c_extended --w3c-time-taken-millisecs --w3c-fields='#Fields: date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs(User-Agent) cs(Referer) cs-host sc-status sc-bytes time-taken'

flodrwho pushed a commit to flodrwho/piwik that referenced this issue Jan 15, 2015
…files in W3C extended log file format can be imported from stdin w/o a '#Fields:' line being present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

3 participants