@sebalis opened this Issue on March 16th 2018

If the import script is called on the same logfile more than once, entries imported in the first run and still present in the file during the second run are imported twice (at least when I tested it on Apache logs). This creates a few problems: running the script has to be tied to the web server rotating the file, and new entries can not be seen in Matomo until the log has been rotated and the importer run again.

It would be much better if it were possible to run the importer frequently, maybe even every minute. In order to do this, the importer needs to be able to know which entries have already been imported. I don’t know how to implement this in detail. I suppose it should be reasonable to expect that the log entries have time data with one-second granularity. So by remembering the time of the latest entry already imported and perhaps adding some logic for re-identifying entries from that last second, it would become feasible to run the importer as often as one would like without messing up the data. What do you think? For me this might make the difference between choosing Matomo or some other log analysis software.

@fdellwing commented on March 16th 2018 Contributor

There is no other open source analystic software with this many features, but I'm with you non the less.

Personally I'm importing access.log.1 every day at 0:00 and log rotating is happening at 5:00. So I have 2 day old logs, but no duplicates,

@sebalis commented on March 18th 2018

Matomo looks very good, I would like to use it – although for privacy reasons I will restrict myself to the log importer, which reduces the difference to other products. Also this makes it all the more important to get as much accuracy and timeliness out of the importer as possible.

@fdellwing commented on March 19th 2018 Contributor

To respect privacy you should imho use the JS tracker because he respects DNT and other tracking blockers while log import will not do that.

@sebalis commented on March 19th 2018

Using JS trackers is out of the question – I do see your point about DNT but let’s not even begin to discuss that. And with my concerns I do of course anonymise my logs (by zeroing the final two bytes of the IP).

@mackuba commented on August 4th 2018

I see that the import_logs script has an --exclude-older-than option (added in December) - would that work, with some kind of "last import" flag that's kept in a file and updated whenever the log is parsed, and then passed to that option? Anyway, I'm planning to set it up this way myself :)

@sebalis commented on November 27th 2018

Sorry for the late response. I havent’t tested it as my workaround was to set up a job to import the ‘.1’ log file (the first ‘rotated’ one). This was possible since the rotation at this site takes place at regular intervals. But it does seem that this option would work. I might be interested in using it for another case where the logs do not rotate so regularly but don’t know when I will get round to it. Feel free to close the issue.

@sebalis commented on November 27th 2018

One minor quibble: --exclude-older-than t₁ would appear to restrict the import to records with a time t ≥ t₁. If I have imported a logfile I know the time t₀ of the latest record I have imported, so it would be convenient to restrict to t > t₀. It seems like I will have to calculate t₁ = t₀ + 1s in order to use this option. Something like --only-newer-than would be slightly better.

@mackuba commented on November 27th 2018

I've actually implemented a PR doing something like this in the meantime: https://github.com/matomo-org/matomo-log-analytics/pull/232

And yes, I tried to use --exclude-older-than first until I noticed I need to do > and not ≥ 😉

My first approach was to find the last timestamp using a separate script in Ruby, but then I realized that the import_logs.py is already going through all lines and parsing dates and stuff from them, so it makes more sense if it finds the timestamp during the import.

Powered by GitHub Issue Mirror