Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalidating the reports when importing visits: dead code + use case issue #7372

Closed
cbay opened this issue Mar 5, 2015 · 12 comments
Closed
Labels
answered For when a question was asked and we referred to forum or answered it.

Comments

@cbay
Copy link
Contributor

cbay commented Mar 5, 2015

As I understand it, since commit fb91155, archived reports are automatically invalidated when importing visits.

First, there's dead code in misc/log-analytics/import_logs.py that declares the (now unused) --invalidate-dates option.

Second, since this option is gone, I now have an issue with my use case. Every few months, I run the archiving process (core:archive), and then purge the archive_* tables.

For example, say I do that after I have imported the logs until March, 3rd (timezone CET). Once the archive_* table are empty, I have to re-import the logs from March, 3rd and NOT invalidate the reports, because Piwik (I assume) works in UTC and would otherwise invalidate reports from the 2nd and 3rd March, as my log files span on two days (UTC).

Now that --invalidate-dates is gone, I'm stuck. I suggest adding a --ignore-dates to the import_logs.py script, which would just ignore the specified dates (in UTC!).

If that solution is considered worthwhile, I can code it (as well as removing the dead code).

@mattab
Copy link
Member

mattab commented Mar 6, 2015

Hi @cbay !

Good to see you here :-)

I've now removed the unused code. If you still need to invalidate the old data, which should now be only for edge cases, you can do so, by calling the API as documented in this FAQ: https://piwik.org/faq/how-to/faq_155/

Regarding the UTC vs other dates, technically Piwik should be smart and invalidate data correctly based on the timezone of the website. It's possible it's not correctly working and there's a bug, but if there is, we should fix this timezone bug... I will close it as I think the FAQ linked above will let you do what you want 👍

@mattab mattab closed this as completed Mar 6, 2015
@mattab mattab added the answered For when a question was asked and we referred to forum or answered it. label Mar 6, 2015
@cbay
Copy link
Contributor Author

cbay commented Mar 6, 2015

Hi @mattab :)

The FAQ tells me how to invalidate old data, but I was asking for a way to NOT invalidate old data. I used to be able to do that with the --invalidate-dates option, but now it's gone.

Let me try to explain my use case again:

  • I import logs every day with import_logs.py
  • I need to purge the log_* tables (I said archive_* in my previous message, that was a mistake) once in a while, so I first run the archiving to all websites and then TRUNCATE those tables
  • when I import the logs for the next day, Piwik will automatically invalidate reports for that day. BUT! since the log_* tables were emptied, they only contain the logs for that one day (24h period), but Piwik will invalidate reports for up to 2 days (because of timezone issues)

What I used to do was to first re-import the log for the day before (which I had already did before the log_* was purged), by setting --invalidate-dates to nothing, so that Piwik doesn't invalidate anything. And THEN, import the log for the current day. Piwik will invalidate reports for that day and the day before, but since the log_* now contains both days, that's OK.

@cbay
Copy link
Contributor Author

cbay commented Mar 7, 2015

Actually, I don't think ignoring dates in import_logs.py would help. An option not to invalidate reports is still needed, I guess.

@mattab
Copy link
Member

mattab commented Mar 8, 2015

they only contain the logs for that one day (24h period), but Piwik will invalidate reports for up to 2 days (because of timezone issues)

It sounds maybe like a bug that Piwik would invalidate dates incorrectly ? by design, piwik should only invalidate reports for the days (in the site's timezone) where new data was imported.

this should work independantly of the log timezone (whether logs are UTC or already in the site's timezone)...

I don't think we need a new option to "not invalidate" reports, because it should 'just work' assuming you only import data for whole days. maybe you imported some few lines from earlier days which would explain those reports were invalidated?

or maybe there is a bug in piwik and if you can create another issue with steps to reproduce, we would take a look 👍

@cbay
Copy link
Contributor Author

cbay commented Mar 8, 2015

assuming you only import data for whole days

Well, that's exactly my point. What's a full day? I import logs for full days CET (UTC+1). And those logs contain thousands of sites that have a lot of different timezones. That's why it won't work with the current implementation: the log_* tables will not contain full days and the reports are invalidated.

And that's why my workaround was to reimport the previous day WITHOUT invalidating reports, just to fill the log_* tables and make sure we DO have full days after purging the tables and continuing the imports. But I cannot do it anymore because the --invalidate-dates option is gone.

@mattab
Copy link
Member

mattab commented Mar 9, 2015

Ok, I now understand that the challenge is that your log files have many websites with loads of different timezones. Adding back the invalidate-dates would not actually fix this issue because the data invalidation is now automatically done server side when tracking data is collected.

because It is a edge case (I think?), I provide here a "hack" manual solution where you can delete the rows holding the status on which websites/dates should be invalidated.
if you run this query after importing the sites, and before running core:archive command, then it should skip the archive invalidation for all dates/sites. Maybe this would work?

DELETE FROM piwik_option WHERE option_name IN ('InvalidatedOldReports_DatesWebsiteIds');

@cbay
Copy link
Contributor Author

cbay commented Mar 9, 2015

I agree it could be considered a edge case (basically, hosting providers). However, I think it also impacts users with a single website in their access log, as long as the access log "timezone" (24h span) is different from the Piwik website timezone.

Let me try with an example. Say you are in New Zealand in UTC+12 but your web hosting provider is in UK (UTC). They provide you 24h log files that contain a whole day, from midnight to midnight (UTC). It means that for you, each time you import your log files, you actually import from noon to noon (UTC+12).

So when you purge the log_* tables, you will necessarily delete half a day (from midnight to noon). And when you import the next log file, the reports will be invalidated, but the next time archiving runs, it will miss half a day.

I'm afraid that use case is not an edge case at all.

I know we cannot re-add the old --invalidate-dates option, but I have a suggestion. How about a --dont-invalidate option, that would simply add a new GET parameter to the HTTP requests made by import_logs.py, and Piwik (PHP) would NOT invalidate the reports when that GET parameter is found?

@mattab
Copy link
Member

mattab commented Mar 11, 2015

So when you purge the log_* tables, [... ]
I'm afraid that use case is not an edge case at all.

What i mean is that manually deleting the log_* tables is an edge case, i didn't hear of anyone doing this except this time... generally users will delete log_ data using the built-in feature in Piwik which will usually prevent this issue naturally since they will delete only data that is eg.30+ day old and wouldn't re-import overlapping days.

I still think that running the following SQL query after your log import is the best solution: DELETE FROM piwik_option WHERE option_name IN ('InvalidatedOldReports_DatesWebsiteIds'); - note that the system is being rewritten in #7377 and this SQL will still be valid after the rewrite. what do you think?

@diosmosis
Copy link
Member

May have to delete options like 'report_to_invalidate_%' as well.

@cbay
Copy link
Contributor Author

cbay commented Mar 11, 2015

OK, I see why it would be a use case. In my case (billions of rows), TRUNCATE is pretty much instantaneous, while DELETE may take days.

Regarding the query you're suggesting: what if a user logs in Piwik during the import? Wouldn't that trigger the archiving (with incomplete data)?

@diosmosis
Copy link
Member

Regarding the query you're suggesting: what if a user logs in Piwik during the import? Wouldn't that trigger the archiving (with incomplete data)?

Not to take the conversation away from @mattab, but if browser triggered archiving is enabled, then yes, it would trigger archiving. (relevant code is in core\Archive.php)

@cbay
Copy link
Contributor Author

cbay commented Mar 11, 2015

Thanks. I guess I'll have to use Piwik to purge log_* to avoid this issue rather than TRUNCATE'ing the tables myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
answered For when a question was asked and we referred to forum or answered it.
Projects
None yet
Development

No branches or pull requests

3 participants