Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL query to find websites with traffic since last successful archiving can take 6+ hours #8571

Merged
merged 2 commits into from Aug 31, 2015

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Aug 13, 2015

refs #8066

Previous behaviour:

  • We either processed specifically specified sites
  • We processed all sites if specified
  • or by default we checked which sites had visits since last archiver run and which sites needed to be reprocessed because of invalidation etc. The got the list of all sites that had queries since last archiving run by doing one big query
  • We checked whether there were visits for a rather long timeframe

Difference to before:

  • Instead of executing the slow query each time an archiver starts we only execute one small query for just one site before a site is actually about to be processed.
  • We do not have to execute this query if we have to archive the site anyway because of other reasons: eg if websiteDayHasFinishedSinceLastRun or if isOldReportInvalidatedForWebsite
  • We check for visits in a shorter timeframe (since midnight in website timezone or since last archiving, whatever is smaller)
  • The log output of core archiver can be bigger since we by default archive all websites and log each site that can be skipped (eg if it had no visits)

We might try to run the archiver on a large Piwik system tmrw. I'm not really sure how to test it and to make sure we actually do not regress anything. Maybe we can deploy it on all our Piwik demos as well

@tsteur tsteur added not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. Needs Review PRs that need a code review labels Aug 13, 2015
@tsteur tsteur added this to the 2.15.0 milestone Aug 13, 2015
@mattab
Copy link
Member

mattab commented Aug 13, 2015

Follow up note:

  • Maybe in Piwik 3.0 we can delete the API SitesManager.getSitesIdWithVisits since:
    • it's not used in core,
    • it's not tested,
    • it's slow when many websites.

@tsteur
Copy link
Member Author

tsteur commented Aug 14, 2015

Marked method as deprecated

@tsteur
Copy link
Member Author

tsteur commented Aug 14, 2015

FYI: We're running the archiver on a big instance and the big query is definitely gone, the archiver starts immediately

@quba
Copy link
Contributor

quba commented Aug 14, 2015

FYI: it breaks MetaSites because there's no raw data for them.

@tsteur
Copy link
Member Author

tsteur commented Aug 14, 2015

Added more log output. Re meta sites I need to have a look how it was worked around there. In theory it should not change behaviour

@tsteur
Copy link
Member Author

tsteur commented Aug 14, 2015

Think I found it already...

@tsteur
Copy link
Member Author

tsteur commented Aug 17, 2015

To make sure we're compatible with MetaSites plugin I added https://github.com/PiwikPRO/plugin-MetaSites/pull/17

@tsteur
Copy link
Member Author

tsteur commented Aug 17, 2015

Can someone please have a look again?

@tsteur tsteur assigned tsteur and unassigned tsteur Aug 17, 2015
@quba
Copy link
Contributor

quba commented Aug 18, 2015

Let's test it tomorrow.

@tsteur
Copy link
Member Author

tsteur commented Aug 21, 2015

Rebased :) just FYI

@@ -21,6 +21,9 @@ This is a changelog for Piwik platform developers. All changes for our HTTP API'
### Internal Change
* The option `branch` of the console command `development:sync-system-test-processed` was removed as it is no longer needed.

### APIs
* There is a new event `CronArchive.getIdSitesNotUsingTracker` that allows you to set a list of idSites that do not use the Tracker API to make sure we archive these sites if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that important, but I think CronArchive.getIdSitesToForceArchivingFor would be a better name (since it is explicit about what the site IDs will be used for). getIdSitesNotUsingTracker I think requires knowing how CronArchive works.

Also, I don't think this event is enough to force archiving for the sites, since if there are no visits archiving will be skipped for them (see: https://github.com/piwik/piwik/blob/master/core/ArchiveProcessor/Loader.php#L121). This is a problem for plugins that display 3rd party data, for example. Should be noted here and the event docs.

OR we could call this event in the core archiver to force archiving even if there are no visits for that site... though perhaps this clashes w/ MetaSites. In this case, a new event could be created. Anyway, this hypothetical new event isn't required for this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it something like getIdSitesToForceArchivingFor (before I squashed them) but found it's not the right name as forcing might be misleading etc. Eg when someone is using skip-sites or force-sites this site might be not archived and therefore not forced. Also it might be not forced if there are no visits see your link. I found the word force too strong. I think from a 3rd party developer point of view it's maybe better understandable to have it getIdSitesNotUsingTracker. It only requires knowing whether a sites is using the Tracker or not. Maybe getIdSitesNotUsingTrackerAPI would be clearer?

I added it to make it work for MultiSites but haven't tested it for any actual 3rd data that only uses it's own plugin archiver (like MobileAppStore maybe does?). Do you think it's easy to fix? One problem that I'm not sure about is that it looks like we're working with multiple idSites there and not just one site see https://github.com/piwik/piwik/blob/master/core/ArchiveProcessor/Loader.php#L133 $this->params->getIdSites() . Probably it's always only one site though. I reckon that's what you suggested and I now understand what you mean with it clashes with MetaSites. It does sound like a new event would be needed but that's kinda too complex. Would it be something like getIdSitesNotUsingLogTables? It's maybe a bit too technical as many developers won't know the log tables.

  • MetaSites is not using Tracker API but Log tables
  • MobileAppStore is not using Tracker API and not Log tables (I think)

Do you have an idea for that event name? I could maybe implement it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getIdSitesNotUsingTrackerAPI would be clearer?

In general, the issue I have w/ these names is that it's not clear what the intent is. The event is used to force CronArchive to send API requests to initiate archiving, but getIdSitesNotUsingTrackerAPI doesn't convey that information. I think it might be very easy for a plugin developer to overlook the event entirely. Maybe CronArchive.getIdSitesToForceArchivingRequestsFor? Or if you think getIdSitesNotUsingTrackerAPI might be re-used in a different context, it would be ok.

Do you have an idea for that event name? I could maybe implement it

How about Archiver.getIdSitesDisplayingThirdPartyData or Archver.getIdSitesDisplayingNonPiwikData? This isn't needed for this PR, but I think it would be a good stop-gap measure for 2.15. Since 3.0 won't be released for say at least 6 months or so, this event would help plugin developers in the meantime. What do you think? If you agree, I'll create a new issue.

@diosmosis
Copy link
Member

Created follow up issue re: the archiving event here: #8631

@diosmosis
Copy link
Member

@tsteur do you have anything else to say regarding the event name or shall I merge?

@tsteur
Copy link
Member Author

tsteur commented Aug 31, 2015

Nah not really. I wouldn't wanna use any event that contains force in archiving since it's simply not really forcing and there would be multiple events needed that contain the word force. getIdSitesDisplayingNonPiwikData or getIdSitesDisplayingThirdPartyData is not really correct either see eg MultiSites plugin that actually uses this event currently. It's not third party data that is shown. The only difference to normal behaviour is that it doesn't use the tracker.

I think after refactoring the archiver we might not need an event anymore at all, hopefully. And if so, we can maybe find a better name or provide a method for it somehow.

@diosmosis
Copy link
Member

Ok, will merge then.

diosmosis added a commit that referenced this pull request Aug 31, 2015
Fixes #8066, do not use SQL query in CronArchive to find websites with traffic since last successful archiving since it can take 6+ hours, instead use fast query for each individual website and use shorter timeframe for visits check.
@diosmosis diosmosis merged commit 1b91a51 into master Aug 31, 2015
@diosmosis diosmosis deleted the 8066 branch August 31, 2015 18:19
diosmosis pushed a commit that referenced this pull request Sep 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants