log import: avoid importing duplicates #264

sebalis · 2018-03-16T12:43:57Z

If the import script is called on the same logfile more than once, entries imported in the first run and still present in the file during the second run are imported twice (at least when I tested it on Apache logs). This creates a few problems: running the script has to be tied to the web server rotating the file, and new entries can not be seen in Matomo until the log has been rotated and the importer run again.

It would be much better if it were possible to run the importer frequently, maybe even every minute. In order to do this, the importer needs to be able to know which entries have already been imported. I don’t know how to implement this in detail. I suppose it should be reasonable to expect that the log entries have time data with one-second granularity. So by remembering the time of the latest entry already imported and perhaps adding some logic for re-identifying entries from that last second, it would become feasible to run the importer as often as one would like without messing up the data. What do you think? For me this might make the difference between choosing Matomo or some other log analysis software.

fdellwing · 2018-03-16T16:01:10Z

There is no other open source analystic software with this many features, but I'm with you non the less.

Personally I'm importing access.log.1 every day at 0:00 and log rotating is happening at 5:00. So I have 2 day old logs, but no duplicates,

sebalis · 2018-03-18T22:35:01Z

Matomo looks very good, I would like to use it – although for privacy reasons I will restrict myself to the log importer, which reduces the difference to other products. Also this makes it all the more important to get as much accuracy and timeliness out of the importer as possible.

fdellwing · 2018-03-19T07:39:06Z

To respect privacy you should imho use the JS tracker because he respects DNT and other tracking blockers while log import will not do that.

sebalis · 2018-03-19T08:48:56Z

Using JS trackers is out of the question – I do see your point about DNT but let’s not even begin to discuss that. And with my concerns I do of course anonymise my logs (by zeroing the final two bytes of the IP).

mackuba · 2018-08-04T21:13:30Z

I see that the import_logs script has an --exclude-older-than option (added in December) - would that work, with some kind of "last import" flag that's kept in a file and updated whenever the log is parsed, and then passed to that option? Anyway, I'm planning to set it up this way myself :)

sebalis · 2018-11-27T21:51:58Z

Sorry for the late response. I havent’t tested it as my workaround was to set up a job to import the ‘.1’ log file (the first ‘rotated’ one). This was possible since the rotation at this site takes place at regular intervals. But it does seem that this option would work. I might be interested in using it for another case where the logs do not rotate so regularly but don’t know when I will get round to it. Feel free to close the issue.

sebalis · 2018-11-27T22:10:44Z

One minor quibble: --exclude-older-than t₁ would appear to restrict the import to records with a time t ≥ t₁. If I have imported a logfile I know the time t₀ of the latest record I have imported, so it would be convenient to restrict to t > t₀. It seems like I will have to calculate t₁ = t₀ + 1s in order to use this option. Something like --only-newer-than would be slightly better.

mackuba · 2018-11-27T23:51:16Z

I've actually implemented a PR doing something like this in the meantime: #232

And yes, I tried to use --exclude-older-than first until I noticed I need to do > and not ≥ 😉

My first approach was to find the last timestamp using a separate script in Ruby, but then I realized that the import_logs.py is already going through all lines and parsing dates and stuff from them, so it makes more sense if it finds the timestamp during the import.

tsteur · 2020-05-24T20:20:41Z

This is a duplicate of #144 I think.

sandrocantagallo · 2023-10-31T10:21:59Z

There is a parameter in the script:

--exclude-older-than

So if you run crontab once at day you can put in param the yesterday date and hour to exclude data in log older (for example becouse log restart every 7 day).

For print yesterday date on linux: date -d "yesterday" +"%Y-%m-%d %H:%M:%S %z"

Findus23 transferred this issue from matomo-org/matomo May 24, 2020

tsteur closed this as completed May 24, 2020

tsteur added the duplicate label May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log import: avoid importing duplicates #264

log import: avoid importing duplicates #264

sebalis commented Mar 16, 2018

fdellwing commented Mar 16, 2018

sebalis commented Mar 18, 2018

fdellwing commented Mar 19, 2018

sebalis commented Mar 19, 2018

mackuba commented Aug 4, 2018

sebalis commented Nov 27, 2018

sebalis commented Nov 27, 2018

mackuba commented Nov 27, 2018

tsteur commented May 24, 2020

sandrocantagallo commented Oct 31, 2023

log import: avoid importing duplicates #264

log import: avoid importing duplicates #264

Comments

sebalis commented Mar 16, 2018

fdellwing commented Mar 16, 2018

sebalis commented Mar 18, 2018

fdellwing commented Mar 19, 2018

sebalis commented Mar 19, 2018

mackuba commented Aug 4, 2018

sebalis commented Nov 27, 2018

sebalis commented Nov 27, 2018

mackuba commented Nov 27, 2018

tsteur commented May 24, 2020

sandrocantagallo commented Oct 31, 2023