Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior during import apache logs #10134

Closed
type0lang opened this issue May 9, 2016 · 19 comments
Closed

Strange behavior during import apache logs #10134

type0lang opened this issue May 9, 2016 · 19 comments
Labels
answered For when a question was asked and we referred to forum or answered it. Bug For errors / faults / flaws / inconsistencies etc. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.

Comments

@type0lang
Copy link

type0lang commented May 9, 2016

Hi
I am using the Piwik (2.16) tracking API with a script to import the actions of historical data. The strange behavior can be summarized in this image:
Two actions separated in two visits even if 1 minute has passed
image_1
Same problem different view.
image_2

A second experiment (after purging the db) with same actions, aggregated into different views...
image_3

Is there something wrong with the aggregation engine ? What is the aggregation logic related to action-> visit ? Are there some parameter to tune this logic ?

@type0lang
Copy link
Author

type0lang commented May 10, 2016

Update. It seems that this behavior happens when log data are unsorted.
Do you know a way to mitigate this behavior (apart from sorting the data)?

@tsteur
Copy link
Member

tsteur commented May 10, 2016

@type0lang is it possible that you send us an example log file with a few lines so we can reproduce?

@type0lang
Copy link
Author

The file access.txt contains access logs (in apache format) that are imported into piwik.
The file access_json.txt contains the same logs json-ized to be put into the paiload of the piwik api http post.

The file sort_apace_py.txt is an utility python script that can be used to sort the logs; it is useful if you want to test piwik behaviour when receiving sorted logs.

access.txt
access_json.txt
sort_apache_py.txt

@tsteur
Copy link
Member

tsteur commented May 12, 2016

I tried to reproduce and it seems to work fine for me I think. Output is eg as follows:

image

Are you sending a token_auth along the request?

@type0lang
Copy link
Author

Yes I am. How can I make a more accurate test in order to collect more data for debugging such behaviour ?

@type0lang
Copy link
Author

@tsteur I am watching your output and actually it still seems not the correct behavior. With the logs I provided I expect to have 1 visit for all the day (until midnight) and not 3 separate visits... isn't it ? The last action of the first visit should be 30 mins distant to the first action of the second visit.

@type0lang
Copy link
Author

Hi here's an example of a "sequentialized logs" import
ordered_logs

Moreover everytime I import the logs I have got a different graph (visits->actions) on piwik dashboard

@type0lang
Copy link
Author

any news ?

@tsteur
Copy link
Member

tsteur commented May 26, 2016

Not so far as we're quite busy. There might be multiple visits as I tried it several times. I will label it as a bug

@tsteur tsteur added the Bug For errors / faults / flaws / inconsistencies etc. label May 26, 2016
@mattab
Copy link
Member

mattab commented May 27, 2016

when log analytics import script detects that the log is unsorted (maybe when detecting a certain % threshold of unsorted lines), we could maybe display a warning or even error, to clarify to users as early as possible that log files must be sorted?

@tsteur
Copy link
Member

tsteur commented May 27, 2016

If they have to be sorted, we could maybe look into the API and debug to find out why it is not working and maybe there's a chance to fix it (if there's a chance to fix it). Tracking requests not coming in the right order is also a problem for javascript tracking etc when one request is faster than another. Alternatively bulk request could sort requests by date maybe, + on top log analytics would need to do this as well as it is sending many bulk requests

@type0lang
Copy link
Author

I add some considerations:

  1. It may happen that even if the data are sorted, the aggregation (visits->actions) may fail (check my last upload). Probably due to high throughputh ?
  2. IMHO piwik should take into account the fact that tracking request may arrive in non sorted way (js tracking, or bulk insert).

@FewKinG
Copy link

FewKinG commented Jun 8, 2016

I am dealing with the same problem right now. I am tracking app users via JavaScript and using an intermediate service that collects the users tracking records, queues them up and sends them in bulk to the Piwik API.
Even though I already took precautions to try to keep records in order and limited the queue worker to just 1 (so no race conditions should occur) it still seems to fail quite often at times, resulting in a lot of new visits where there shouldn't be any:

image

... and thereby in wild changes in the relation between unique visitors and total number of visits with the visits count being much too high on some days:

image

We started noticing this behavior after April 22nd this year, though I can't tell anymore if we updated piwik on that day (may well be the case, so maybe this relates to a change being introduced in Piwik shortly before that).
Sending the records in bulk seems not to be the culprit as this is a change we made later.

I am also correctly sending token_auth and record timestamp (cdt) and couldn't find anything else wrong with the records itself. And as seen in the graph, the system used to work perfectly before. This is quite annoying and for optimal ease of use the order in which the records are sent to the tracking API shouldn't matter when the timestamp is set, which would be the expected behavior anyway.

@mattab
Copy link
Member

mattab commented Jul 7, 2016

Even though I already took precautions to try to keep records in order and limited the queue worker to just 1 (so no race conditions should occur) it still seems to fail quite often at times, resulting in a lot of new visits where there shouldn't be any:

@FewKinG Could you please attach a sample log file and the command you use, that we can use to reproduce the issue? if there is a data/log file sample that lets us reproduce the issue, it will be possible for us to investigate and solve the issue.

@mattab mattab added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Jul 14, 2016
@mattab mattab added this to the 2.16.x (LTS) milestone Jul 14, 2016
@mattab mattab added Regression Indicates a feature used to work in a certain way but it no longer does even though it should. and removed Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. labels Jul 14, 2016
@FewKinG
Copy link

FewKinG commented Jul 20, 2016

I'm not doing log file imports, this behavior shows during normal operation. I get tracking events from the clients, collect them in a queue until I have a certain amount of events stored, then send them to piwik in a bulk request. The situation is no different when I'm not doing bulk requests but send the tracking events individually.

Our tracking system cannot and should not have to always reliably ensure that events are processed and sent to piwik in order. But even if I try to enforce it (only handle the next event, when the previous HTTP request to piwik is finished which is usually not practical in normal operation), the data in piwik shows above weirdness even though it may be less extreme.

Because our client applications don't immediately send their tracking events to our backend but collect data in batches and transmit when there are a certain amount of events collected and internet connection is available, we include a timestamp in the tracking API request (cdt) to specifically set the event to be recorded at the correct time.

I don't think the issue is easily reproducible. I did not find any relation between the data being tracked and false visits being recorded, if any there is a correlation with the amount of events being tracked per time (the more events tracked at once/in parallel, the more likely false visits occur). Maybe @type0lang can give some more hints or sample data.
Here are some sample parameters included in our tracking requests (will be transformed to a URL parameter string when being sent to piwik). I replaced some information with 'X' due to privacy reasons. The third sample shows an event that should explicitly trigger a new visit.

rec=1, idsite=10, url=http://ti.app/event/st-pauli-nachtmarkt-4, _id=X, token_auth=X, name=Event/st-pauli-nachtmarkt-4, cvar=undefined, dimension1=Android, dimension2=5.1.1, dimension3=HUAWEI P7-L10, dimension5=360x640@2, dimension4=1.2.1, cdt=2016-07-20 09:54:00, cip=X, ua=Mozilla/5.0 (Linux; Android 5.1.1; HUAWEI P7-L10 Build/HuaweiP7-L10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Crosswalk/16.45.421.19 Mobile Safari/537.36, lang=de-de

rec=1, idsite=10, url=http://ti.app/, _id=X, token_auth=X, name=Events/Event/Like, e_c=Events, e_a=Event/Like, e_n=s/preview-star-trek-beyond-3dov, e_v=undefined, cvar=undefined, cdt=2016-07-20 09:54:02, cip=X, ua=Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 (5358564944), lang=de-de
rec=1, idsite=10, url=http://ti.app/, _id=X, token_auth=X, new_visit=1, name=system/started, e_c=system, e_a=started, e_n=undefined, e_v=undefined, cvar=undefined, dimension1=Android, dimension2=5.1.1, dimension3=HUAWEI P7-L10, dimension5=360x640@2, dimension4=1.2.1, cdt=2016-07-20 09:53:45, cip=X, ua=Mozilla/5.0 (Linux; Android 5.1.1; HUAWEI P7-L10 Build/HuaweiP7-L10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Crosswalk/16.45.421.19 Mobile Safari/537.36, lang=de-de

@FewKinG
Copy link

FewKinG commented Oct 14, 2016

Had a quick look at the piwik code. Seems that on a tracking request, piwik tries to find an existing visit inside a certain time window to which to append the new request onto. If it doesn't find one, or finds one that is too long ago, it will count the request towards a new visit. Makes sense generally.

But consider I want to track some requests belonging to a new visit with the first of those requests actually forcing new_visit=1. And also consider these requests to be out of order, so the second requests arrives first. Piwik now has no way to find an existing visit to append the new request onto and thus will create a new visit. It can't know in advance that I'm later going to supply a request that actually still happened before that. The visitor lookup using a time window considers this by also looking into the future BUT it only does that if new_visit is not forced. Otherwise it will count the request arriving second as another new visit.

This is only one theory I have which might explain all or at least some of the behavior I was seeing. And this scenario of course only applies because I regularly manually force new_visit. Don't know if this also is the case for you @type0lang ?!
Also I don't yet know what to do about this behavior, I don't like the idea of somehow ensuring the requests always come in ordered.

@type0lang
Copy link
Author

I have attached my data in the previous post. But you can reproduce my case by just having a log with same urls http://mywebsite.com/foo/bar and date of visits each second.
Then you can send those log to piwik with an high throughput.. IMHO that's the point: high throughput or high parallelism. It happens that it will just doesn't put the new actions in the same visit even if they are 1 second after.

@mattab
Copy link
Member

mattab commented Nov 3, 2016

And also consider these requests to be out of order, so the second requests arrives first.

fyi: Piwik assumes visits/tracking requests are sent in chronological order. if this assumption is not enforced, the data will be incorrect in Piwik.

@mattab
Copy link
Member

mattab commented Jun 19, 2017

It seems the problem is caused by data imported to Piwik in the wrong chronological order, which by design we cannot (at this time) deal with correctly. Solution should be to ensure all data comes to piwik in the right chronological order.

@mattab mattab closed this as completed Jun 19, 2017
@mattab mattab added the answered For when a question was asked and we referred to forum or answered it. label Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it. Bug For errors / faults / flaws / inconsistencies etc. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Projects
None yet
Development

No branches or pull requests

4 participants