New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL query to find websites with traffic since last successful archiving can take 6+ hours #8066
Comments
Temptatively adding to 2.14.0 in case something could be quickly done about this issue. |
mysqldumpslow output for this query on one client instance, shows how slow this is (on average the query takes one hour)
(AFAIK the query was executed 291 times over a 3 months period ie. 3 times a day) |
Is there an instance to reproduce this and to test different queries / possible fixes? |
@tsteur Let's figure this out next week in the office, where we will have access to the instance to reproduce |
It seems to be using the correct index etc. and couldn't figure out how to make it faster. It might be faster to do a separate query for each site before starting the archiver for a site instead of one big query to get them all. This would be also more accurate as there might be new visits by the time the archiver actually gets to archive a specific site |
@tsteur +1 I wonder if it's really needed to go x days back? I noticed that if there has been no archiving process running for a few days, this query goes back to the timestamp saved in piwik_option table. Maybe it's possible to simplify it a little bit, e.g. doing a select with |
But this approach requires some investigation, e.g. how will this one work with multi-thread archiving? Should we assume that it iterates through all the websites (same as --force-all-websites)? |
are you saying if |
Re multi-threading: If |
We have Regarding second question - in the query provided by Matt, there is a WHERE clause checking a few days back. I'm not an expert but I think that it makes no sense in this case. |
Just FYI: It should use this index no matter if it's for a single site or for all. Whether we have to look a few days back or not can be probably only answered by @mattab |
I thought about this and I think that our current logic can be improved: this is OPTION A) The current logic is to select all websites that had visits since the last run of archiving. AFAIK this is not needed to select all sites with traffic "since the last run". Instead we could "simply" select all websites with traffic "today" in the website's timezone. This would be good enough because all websites with traffic "yesterday or earlier" would anyway be processed by
If this will be faster than +1 to try this approach. This is OPTION B) So I guess we have two options to try: A) Restrict the SQL query and make the timestamp range smaller. ( B) Change the logic and instead of |
Option A will be still slow. I checked this one and it was slow even when going back by 3 hours. So I'd personally try option B. |
I will go for option A and B A) should make the query a bit faster I made some changes so far here but it is really hard to not regress / to make sure everything still works: https://github.com/piwik/piwik/compare/8066?expand=1 |
Here is an example query from one of our Enterprise Piwik instances:
The goal of this issue is to change this implementation so the
core:archive
is fast, even when there are thousands of websites. For example we can use a more efficient SQL query or break this query down to thousands of smaller ones for each website.The text was updated successfully, but these errors were encountered: