The sql queries used in the archiving: https://github.com/piwik/piwik/blob/master/plugins/Actions/Actions.php#L135
are not bounded, ie. they select all the rows, and then loops over all the result set and processes the php array containing all the data. If the default limit values haven't been changed in the config file (see explanation in the faq) then the archiving limits the number of rows in the php array. The problem is that this is done too late: millions of rows could have been processed from which we would only keep the top 5000 pages.
Unfortunately, this is by design: Piwik keeps 500 rows for the first level table, and keeps 100 rows for each of the children tables (ie. when you click on a row to expand it). Reference: https://github.com/piwik/piwik/blob/master/config/global.ini.php#L138
For a given day for a given website, you could basically record in the database up to 500 * (100 * 100 * ...), ie. millions of rows. This could be fine, but on websites where this is actually happening, this is way too much data to be handled by a single server. The initial SQL query which tries to return these millions of rows will have to swap and block on IO for minutes or hours to try and deliver this much data, temporarily LOCKing the tables involved.
The solution would be to introduce a new "hard limit" on the number of rows returned by the SQL, and have it hardcoded in the sql archiving queries. Because the queries sort by count descending, we could simply have a LIMIT 0, 50000 for example. This way, this would ensure a bounded data set and predictable behavior. This sounds like a rather simple fix which should just work...
Everything was working great on 0.4.5. After updating to 0.5.1 the archive script runs for hours and hours. I managed to get it to archive all daily stats yesterday but that took at least an hour to run. Last night's cron was still running at 11AM when I checked it (9 hours). This is on a quad core Xeon server. Here are the number of rows in the main tables:
This wouldn't be a big deal if it didn't LOCK TABLES and require extensive use of temp tables. It completely locks out any use of Piwik resulting in issues for anyone accessing a site that uses Piwik or anyone trying to view stats. It also puts a high load on the server slowing general operation.
Here's a sample of some of the slow queries from my logs:
# Time: 091212 11:03:02 # User<a class='mention' href='https://github.com/Host'>@Host</a>: piwik[piwik] @ [10.2.1.239] # Query_time: 60 Lock_time: 0 Rows_sent: 6 Rows_examined: 44481023 SELECT name, type, count(distinct t1.idvisit) as nb_visits, count(distinct visitor_idcookie) as nb_uniq_visitors, count(*) as nb_hits FROM (piwik_log_visit as t1 LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit)) LEFT JOIN piwik_log_action as t3 ON (t2.idaction_url = t3.idaction) WHERE visit_server_date = '2009-12-12' AND idsite = '7' GROUP BY t3.idaction ORDER BY nb_hits DESC; # Time: 091212 11:08:39 # User<a class='mention' href='https://github.com/Host'>@Host</a>: piwik[piwik] @ [10.2.1.239] # Query_time: 36 Lock_time: 0 Rows_sent: 3 Rows_examined: 26729656 SELECT name, type, count(distinct t1.idvisit) as nb_visits, count(distinct visitor_idcookie) as nb_uniq_visitors, count(*) as nb_hits FROM (piwik_log_visit as t1 LEFT JOIN piwik_log_link_visit_action as t2 USING (idvisit)) LEFT JOIN piwik_log_action as t3 ON (t2.idaction_name = t3.idaction) WHERE visit_server_date = '2009-12-12' AND idsite = '39' GROUP BY t3.idaction ORDER BY nb_hits DESC;
I don't think the LIMIT 0, 50000 (for example) will actually solve the issue. Mysql will still have to select the full result set, sort it, and then limit it. Maybe it will be necessary to