@sebalis opened this Issue on March 16th 2018

If the import script is called on the same logfile more than once, entries imported in the first run and still present in the file during the second run are imported twice (at least when I tested it on Apache logs). This creates a few problems: running the script has to be tied to the web server rotating the file, and new entries can not be seen in Matomo until the log has been rotated and the importer run again.

It would be much better if it were possible to run the importer frequently, maybe even every minute. In order to do this, the importer needs to be able to know which entries have already been imported. I don’t know how to implement this in detail. I suppose it should be reasonable to expect that the log entries have time data with one-second granularity. So by remembering the time of the latest entry already imported and perhaps adding some logic for re-identifying entries from that last second, it would become feasible to run the importer as often as one would like without messing up the data. What do you think? For me this might make the difference between choosing Matomo or some other log analysis software.

@fdellwing commented on March 16th 2018 Contributor

There is no other open source analystic software with this many features, but I'm with you non the less.

Personally I'm importing access.log.1 every day at 0:00 and log rotating is happening at 5:00. So I have 2 day old logs, but no duplicates,

@sebalis commented on March 18th 2018

Matomo looks very good, I would like to use it – although for privacy reasons I will restrict myself to the log importer, which reduces the difference to other products. Also this makes it all the more important to get as much accuracy and timeliness out of the importer as possible.

@fdellwing commented on March 19th 2018 Contributor

To respect privacy you should imho use the JS tracker because he respects DNT and other tracking blockers while log import will not do that.

@sebalis commented on March 19th 2018

Using JS trackers is out of the question – I do see your point about DNT but let’s not even begin to discuss that. And with my concerns I do of course anonymise my logs (by zeroing the final two bytes of the IP).

@mackuba commented on August 4th 2018

I see that the import_logs script has an --exclude-older-than option (added in December) - would that work, with some kind of "last import" flag that's kept in a file and updated whenever the log is parsed, and then passed to that option? Anyway, I'm planning to set it up this way myself :)

Powered by GitHub Issue Mirror