@mattab opened this Issue on April 13th 2015 Member

During core:archive cron execution, it can happen that Mysql server fails and then the core:archive script itself will die because the connection to database was cut. What this happens the script has to be restarted manually or it will wait until the next cron run. This causes complications to Piwik administrators and we want to make the script smarter, so it tries to re-establish connection to Mysql in case the db server had only a temporary problem.

Issue context

Imagine a long running core:archive script run. It triggers a lot of requests to the API to pre-process websites. Often a Mysql server may fail and return an error such as Mysql server has gone away. When this happens, the API request will fail and this failure be displayed in the core:archive script output, which will proceed with the next request. This works fine and the script continues with the remaining websites..

The problem occurs when the console command connection to Mysql has itself been broken. then core:archive will die any remaining websites and segments will not be pre-processed.


To reproduce: during a core:archive run, restart the Mysql server.

the core:archive script will error with an error:

INFO CoreConsole[2015-04-13 04:28:00] Archived website id = 83, period = month, 0 visits in last last52 months, 0 visits this month, Time elapsed: 0.447s
INFO CoreConsole[2015-04-13 04:28:00] Will pre-process for website id = 83, year period, the following 1 segments: { pageUrl!=xx } 
ERROR CoreConsole[2015-04-13 04:28:01] Error unserializing the following response from http://localhost/piwik-master/index.php?module=API&method=API.get&idSite=83&period=year&date=last7&format=php&token_auth=removed&trigger=archivephp: SQLSTATE[HY000] [2003] Can&<a href='/039'>#039</a>;t connect to MySQL server on &<a href='/039'>#039</a>;<a href='/039'>#039</a>; (111) 
INFO CoreConsole[2015-04-13 04:28:01] Archived website id = 83, period = year, 0 visits in last last7 years, 0 visits this year, Time elapsed: 0.469s
ERROR CoreConsole[2015-04-13 04:28:01] SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted

  SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted  

Here is an error that happened in production Piwik:

2015-03-21 07:27:54 INFO CoreConsole[2015-03-21 07:27:54] [4d9c4] Archived website id = 9, period = week, 40731912 visits in last last4 weeks, 7821120 visits this week, Time elapsed: 195.531s

2015-03-21 16:06:26 INFO CoreConsole[2015-03-21 16:06:26] [4d9c4] Archived website id = 9, period = month, 157882146 visits in last last4 months, 31272773 visits this month, Time elapsed: 31111.368s

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

Proposed solution

The goal is to prevent such error and proceeding with archiving of remaining websites.

  • when an error occured for a request, do not re-try it: it's important to skip the failed request. For example if the archiving failed for a website because memory exhaust, it would fail again the next time. by skipping we ensure the remaining sites will be processed.
  • when a DB breaks temporarily, the script should be smart and wait a bit to retry.
    • could a retry mechanism re-connect to database after 1s and retry N times, logging a warning or so, be implemented in the command?
    • how to detect such errors? error codes 70100 and 2006 are a starting point
@SR-mkuhn commented on July 10th 2015

It seems like, for every site two mysql connections are established.
The first one sits idle and the other one does some work.
If the first one reaches wait_timeout everything fails.

One solution could be eliminating the idle connection.

@nekohayo commented on September 7th 2015

So as I understand it, the error is "harmless" in the sense that the user can just repeatedly re-run the console archive command until the thing succeeds? There will be no incomplete/corrupt data?

As it stands, the official technique involving "console core:archive" does not work at all on shared hosts like Hostgator's; on mine, I am unable to process piwik tracking logs on the commandline through SSH even for some fairly low/medium traffic websites (the piwik instance I'm currently looking at only has 38 thousand rows of tracking log tables, weighting about 6.4 MB). From what I understood of my discussions with HG's support staff, MySQL processes get killed when they use too much CPU time for too long.

@nekohayo commented on September 8th 2015

FWIW, I received this additional piece of info/suggestion from Hostgator today:

Since the script is being killed due to CPU limitations on our shared and reseller servers, the suggested solution would be to configure the script to perform its operations in batches instead of attempting to perform all of the work during a single process and MySQL connection. If the script can be configured to spawn separate processes to handle the work in batches, you can avoid the issue of a single process hitting the CPU limits and being killed

I don't know if that is a valid suggestion, and if it is, if it should be filed as a separate issue?

@mattab commented on September 8th 2015 Member

@nekohayo The core:archive already will trigger a separate process for each of the archiving requests. so this should already be implemented

@mattab commented on September 18th 2015 Member

as @SR-mkuhn pointed out, the common workaround to solve this issue (in most cases) is to increase wait_timeout in Mysql configuration.

@nekohayo commented on September 18th 2015

You can't actually do that as a user of a shared host, no?

@SR-mkuhn commented on September 18th 2015

@mattab, but wouldn't it be better to work with one mysql connection instead of two AND requirement of increasing wait_timeout?
One connection seems to gather data and at the end, the other idle one writes something.
If the first connection has to do a lot - the first send command by the other connection exceeds the timeout limit.
If those database connections would be merged - a increase of wait_timeout should be absolutly unnecessary.

@bartek85 commented on October 10th 2015

Maybe I can give you some more data to help you? I have a problem after upgrade from 2.12.0 to 2.14.3, it goeas away after archiving one website and switching to another.

wait_timeout=28800 (it's not the case)
max_allowed_packet_size=512MB (not the case, changing doeasn't made any difference, on 2.12.0 - 128MB - same data)
connection_timeout=28800 - that one helped a bit, but on 2.12.0 it's set to "10" and everything is working.

It's new archive (droped archive tables) with 4,4k websites and 40+GB piwik_log_visit.

@monkz commented on January 9th 2016

i think i might found a solution to this problem, altough i have no patch.

As described by @SR-mkuhn , core:archive seems to use two mysql connections.
This is because the singleton in https://github.com/piwik/piwik/blob/master/core/Db/Adapter/Pdo/Mysql.php loses its context.
CliMulti is the reason for it - it spawns a seperate process for a request and this process doesn't have any information about the mysql connection in the parent process.
While waiting for those worker processes to finish, there is no action in the mysql connection of the parent, so we run into the wait_timeout threshold and everything fails.

My recommendation is:
While waiting on child processes in https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php , execute a NOOP query once or twice during the period of wait_timeout inside the parent.
Select 1;, SET <a class='mention' href='https://github.com/ignore_me'>@ignore_me</a> = 0; or DO 0; should be sufficient. http://lists.mysql.com/mysql/142004
Like every PDO gets a noop function and we can call it from CliMulti.


@SR-mkuhn commented on March 10th 2016

Looks good.

@twanmeijerink commented on April 7th 2016

Hi @monkz,
we seem to have the same problem as you described. However, we do not really understand your solution. Can you describe in more detail where you exactly added the NOOP query?

Thanks in advance, Twan

@monkz commented on April 7th 2016

Hi @twanmeijerink ,
so far the concept of a noop is currently not implemented and i'm not a php developer.
It seems like, if a child process needs longer, the parent dies due to inactivity.
A NOOP command might be introduced here: https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php#L113

My solution is just a recommendation/suggestion for @mattab and @tsteur .

@geek-merlin commented on February 4th 2018

I solved a similar problem recently.
Bottom line:

  • Doing regular "NOP" sql queries needs complex code changes. Same for a reconnect logic.
  • Increasing wait_timeout generally as suggested here is bad, this should not be done for ordinary web requests (DOS!)
  • Increasing wait_timeout for the current session is imho the way to go and as simple via SQL as (say) SET SESSION wait_timeout = 900.
@Jon-IB commented on November 13th 2018

This is not just a matter of wait_timeout. I'm trying to archive 125 days of traffic on a lot of low-traffic sites, because I'm migrating to a new Piwik instance (long story). This causes MySQL to kill the connection. If the number of days is set to 125, then I usually get a failure within 30 seconds of the start of the archiving process.

I haven't figured out if the error is because of the size of the received packet, or perhaps even the number of rows to insert is too large.

From https://dev.mysql.com/doc/refman/5.7/en/gone-away.html :

An INSERT or REPLACE statement that inserts a great many rows can also cause these sorts of errors.

My wait_timeout appears to be set to 8 hours. I see the failures when the archiver is working on sites with more traffic, but the sites in question only have ~ 3000 visits in the last couple months. That doesn't seem like a lot of data to process at a time.

Regardless of the cause, it would be great if the archiver recovered from this error.

@mattab commented on May 4th 2020 Member

Thanks everyone for your feedback. We will very likely not implement this in the next few months so I'm closing this issue as wontfix for now.

This Issue was closed on May 4th 2020
Powered by GitHub Issue Mirror