New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behavior during import apache logs #10134
Comments
Update. It seems that this behavior happens when log data are unsorted. |
@type0lang is it possible that you send us an example log file with a few lines so we can reproduce? |
The file The file |
Yes I am. How can I make a more accurate test in order to collect more data for debugging such behaviour ? |
@tsteur I am watching your output and actually it still seems not the correct behavior. With the logs I provided I expect to have 1 visit for all the day (until midnight) and not 3 separate visits... isn't it ? The last action of the first visit should be 30 mins distant to the first action of the second visit. |
any news ? |
Not so far as we're quite busy. There might be multiple visits as I tried it several times. I will label it as a bug |
when log analytics import script detects that the log is unsorted (maybe when detecting a certain % threshold of unsorted lines), we could maybe display a warning or even error, to clarify to users as early as possible that log files must be sorted? |
If they have to be sorted, we could maybe look into the API and debug to find out why it is not working and maybe there's a chance to fix it (if there's a chance to fix it). Tracking requests not coming in the right order is also a problem for javascript tracking etc when one request is faster than another. Alternatively bulk request could sort requests by date maybe, + on top log analytics would need to do this as well as it is sending many bulk requests |
I add some considerations:
|
I am dealing with the same problem right now. I am tracking app users via JavaScript and using an intermediate service that collects the users tracking records, queues them up and sends them in bulk to the Piwik API. ... and thereby in wild changes in the relation between unique visitors and total number of visits with the visits count being much too high on some days: We started noticing this behavior after April 22nd this year, though I can't tell anymore if we updated piwik on that day (may well be the case, so maybe this relates to a change being introduced in Piwik shortly before that). I am also correctly sending token_auth and record timestamp (cdt) and couldn't find anything else wrong with the records itself. And as seen in the graph, the system used to work perfectly before. This is quite annoying and for optimal ease of use the order in which the records are sent to the tracking API shouldn't matter when the timestamp is set, which would be the expected behavior anyway. |
@FewKinG Could you please attach a sample log file and the command you use, that we can use to reproduce the issue? if there is a data/log file sample that lets us reproduce the issue, it will be possible for us to investigate and solve the issue. |
I'm not doing log file imports, this behavior shows during normal operation. I get tracking events from the clients, collect them in a queue until I have a certain amount of events stored, then send them to piwik in a bulk request. The situation is no different when I'm not doing bulk requests but send the tracking events individually. Our tracking system cannot and should not have to always reliably ensure that events are processed and sent to piwik in order. But even if I try to enforce it (only handle the next event, when the previous HTTP request to piwik is finished which is usually not practical in normal operation), the data in piwik shows above weirdness even though it may be less extreme. Because our client applications don't immediately send their tracking events to our backend but collect data in batches and transmit when there are a certain amount of events collected and internet connection is available, we include a timestamp in the tracking API request (cdt) to specifically set the event to be recorded at the correct time. I don't think the issue is easily reproducible. I did not find any relation between the data being tracked and false visits being recorded, if any there is a correlation with the amount of events being tracked per time (the more events tracked at once/in parallel, the more likely false visits occur). Maybe @type0lang can give some more hints or sample data.
|
Had a quick look at the piwik code. Seems that on a tracking request, piwik tries to find an existing visit inside a certain time window to which to append the new request onto. If it doesn't find one, or finds one that is too long ago, it will count the request towards a new visit. Makes sense generally. But consider I want to track some requests belonging to a new visit with the first of those requests actually forcing new_visit=1. And also consider these requests to be out of order, so the second requests arrives first. Piwik now has no way to find an existing visit to append the new request onto and thus will create a new visit. It can't know in advance that I'm later going to supply a request that actually still happened before that. The visitor lookup using a time window considers this by also looking into the future BUT it only does that if new_visit is not forced. Otherwise it will count the request arriving second as another new visit. This is only one theory I have which might explain all or at least some of the behavior I was seeing. And this scenario of course only applies because I regularly manually force new_visit. Don't know if this also is the case for you @type0lang ?! |
I have attached my data in the previous post. But you can reproduce my case by just having a log with same urls http://mywebsite.com/foo/bar and date of visits each second. |
fyi: Piwik assumes visits/tracking requests are sent in chronological order. if this assumption is not enforced, the data will be incorrect in Piwik. |
It seems the problem is caused by data imported to Piwik in the wrong chronological order, which by design we cannot (at this time) deal with correctly. Solution should be to ensure all data comes to piwik in the right chronological order. |
Hi
I am using the Piwik (2.16) tracking API with a script to import the actions of historical data. The strange behavior can be summarized in this image:
Two actions separated in two visits even if 1 minute has passed
Same problem different view.
A second experiment (after purging the db) with same actions, aggregated into different views...
Is there something wrong with the aggregation engine ? What is the aggregation logic related to action-> visit ? Are there some parameter to tune this logic ?
The text was updated successfully, but these errors were encountered: