Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_logs.py ignores lines after a line with http 200 status is processed #5161

Closed
vspiliop opened this issue May 13, 2014 · 12 comments
Closed
Labels
Bug For errors / faults / flaws / inconsistencies etc. duplicate For issues that already existed in our issue tracker and were reported previously.

Comments

@vspiliop
Copy link

Hello to all!

I am using piwik for a customer and just found out the following very serious issue.

I am using the latest piwik (2.2.2), php 5.4.26 and Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) v.1500 32 bit (Intel) on win32.

PROBLEM:

All lines (in the web log) after a line with HTTP status 200 are ignored!! i.e. in the following example only the first entry is included both to the Visits and to the Actions. This applies before or after I do the achieving. So archiving is irrelevant.

I just import (access.log : file with just 2 lines):

66.249.76.11 - - +0100 "GET /id/resource/013541589 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.11 - - +0100 "GET /doc/resource/007667232 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

via command:

python import_logs.py --url=http://localhost:83/

analytics/ access.log --idsite=1 --recorders=2 --enable-http-errors --enable-http-redirects --enable-static --ena
ble-bots

Result:

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log access_006_bl.services.tso.co.uk.2014.05.12.log...
Purging Piwik archives for dates: 2014-05-11
To re-process these reports with your new update data, execute the following command:

piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/

Logs import summary

2 requests imported successfully
0 requests were downloads
0 requests ignored:

    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

2 requests imported to 1 sites

    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 0 seconds
Requests imported per second: 3.29 requests per second

Kind Regards,
Vassilis

@vspiliop
Copy link
Author

Hi again,

After some investigation I found the following:

The only difference between two HTTP calls (each one corresponds to one of the lines in the file) to the Piwik Server is:

'action_name': u'303/URL=http%3A%2F%2Fwww.british-library.co.uk%2Fid%2Fresource%2F013541591'

Specifically, 'action_name' is only included if the hit (i.e. line in the file) is an error or redirect. If it is not included (i.e. http 200 status), then the line is ignored by Piwik Server completely.

Is this behavior a bug? Do I miss something? If I alter the import_log.py script to always include 'action_name' for all lines, then all lines are included in the results of the Piwik Server. But I am not sure if by doing so I could possibly cause other issues.

Looking forward to a reply by a more Piwik familiar programmer.. :-)

Thanks in advance!

In detail:

{
    'apiv': '1',
    'cvar': u'{
        "1": [
            "HTTP-code",
            "303"
        ]
    }',
    'action_name': u'303/URL=http%3A%2F%2Fwww.british-library.co.uk%2Fid%2Fresource%2F013541591',
    'cdt': '2014-05-1122: 59: 59',
    'urlref': '',
    'dp': '1',
    'url': 'http: //www.british-library.co.uk/id/resource/013541591',
    'cip': u'66.249.76.11',
    'idsite': '1',
    '_cvar': u'{
        "1": [
            "Bot",
            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html"
        ]
    }',
    'rec': '1',
    'bots': '1',
    'ua': 'Mozilla/5.0(compatible;Googlebot/2.1;+http: //www.google.com/bot.html'
},
{
    'apiv': '1',
    'cvar': u'{
        "1": [
            "HTTP-code",
            "200"
        ]
    }',
    'cdt': '2014-05-1105: 18: 14',
    'urlref': '',
    'dp': '1',
    'url': 'http: //www.british-library.co.uk/doc/resource/GBB138900',
    'cip': u'88.198.56.239',
    'idsite': '1',
    '_cvar': u'{
        "1": [
            "Not-Bot",
            "Mozilla/5.0 (compatible; Windows; U; Windows NT 6.2; WOW64; en-US; rv:12.0) Gecko/20120403211507 Firefox/12.0"
        ]
    }',
    'rec': '1',
    'bots': '1',
    'ua': 'Mozilla/5.0(compatible;Windows;U;WindowsNT6.2;WOW64;en-US;rv: 12.0)Gecko/20120403211507Firefox/12.0'
} 

@mattab
Copy link
Member

mattab commented May 14, 2014

I just import (access.log : file with just 2 lines)

What happens is that these two lines are for the same pageview, in the same second. By design Piwik tracks a given pageview only once per second. So could you try to set the other pageview 1 or 2 seconds later and try again?

@vspiliop
Copy link
Author

Hi Matt,

Tried (clean Piwik installation) with the following lines:

66.249.76.11 - - +0100 "GET /id/resource/013541581 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.11 - - +0100 "GET /doc/resource/007667231 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Add still the result is the same. Also, the requests are for different resources in both cases. So I assumed that they should not be counted as a single page view even if the time-stamp was exactly the same.

Any ideas?

Thanks,
Vassilis

@mattab
Copy link
Member

mattab commented May 15, 2014

Attachment:
two requests imported.png

@mattab
Copy link
Member

mattab commented May 15, 2014

I created test-log.log as follows:

66.249.76.11 - - [14/May/2014:23:59:50 +0100] "GET /id/resource/013541589 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.11 - - [14/May/2014:23:59:59 +0100] "GET /doc/resource/007667232 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Then imported it with:
$ ./misc/log-analytics/import_logs.py --url=localhost/piwik-master test-log.log --idsite=1 --enable-bots --enable-http-redirects

Then I see both requests in the visitor log, see screenshot: http://issues.piwik.org/attachments/5161/two%20requests%20imported.png

So please upgrade to latest beta: http://piwik.org/faq/how-to-update/faq_159/

and let me know if you still have the problem?

@vspiliop
Copy link
Author

Hi Matt,

I followed the instruction at http://piwik.org/faq/how-to-update/faq_159/.

I selected --> When checking for new version of Piwik, always get: "The latest beta release", but when I press "Check for updates" I get "You are using the latest piwik version: 2.2.2"..

Can you please help.

Is there any direct link for piwik latest beta?

Thanks again!
Vassilis

@vspiliop
Copy link
Author

I found this one: http://builds.piwik.org/piwik-2.2.3-b4.zip, which seems to be the latest. I will try it and come back to you.

@vspiliop
Copy link
Author

ok it works fine with 2.2.3-b4!

Thanks for the help Matt!

When do we expect the stable 2.2.3?

Do you think I could go live with 2.2.3-b4?

I think the best solution is a patch, as the bug is quite serious. Is there a patch I could apply to 2.2.2?

Regards,
Vassilis

@vspiliop
Copy link
Author

Attachment: two_lines_import
piwik_problem.jpg

@vspiliop
Copy link
Author

Hi again,

sorry for the wrong feedback.

It still does not work with 2.2.3-b4. I imported after a fresh install the two lines and only the 303 is depicted.

Is the latest beta 2.2.3-b4?

Thanks,
Vassilis

@vspiliop vspiliop added this to the 2.5.0 - Piwik 2.5.0 milestone Jul 8, 2014
@mattab mattab removed the P: normal label Aug 3, 2014
@mattab
Copy link
Member

mattab commented Dec 19, 2014

Hi @vspiliop do you still experience the issue with 2.9.1 ? please let us know, thanks

@mattab
Copy link
Member

mattab commented Mar 12, 2015

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163

@mattab mattab closed this as completed Mar 12, 2015
@mattab mattab added the duplicate For issues that already existed in our issue tracker and were reported previously. label Mar 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. duplicate For issues that already existed in our issue tracker and were reported previously.
Projects
None yet
Development

No branches or pull requests

2 participants