Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the core:archive script dies with Mysql server has gone away, can we reconnect to DB automatically instead of erroring? #7669

Closed
mattab opened this issue Apr 13, 2015 · 15 comments
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.

Comments

@mattab
Copy link
Member

mattab commented Apr 13, 2015

During core:archive cron execution, it can happen that Mysql server fails and then the core:archive script itself will die because the connection to database was cut. What this happens the script has to be restarted manually or it will wait until the next cron run. This causes complications to Piwik administrators and we want to make the script smarter, so it tries to re-establish connection to Mysql in case the db server had only a temporary problem.

Issue context

Imagine a long running core:archive script run. It triggers a lot of requests to the API to pre-process websites. Often a Mysql server may fail and return an error such as Mysql server has gone away. When this happens, the API request will fail and this failure be displayed in the core:archive script output, which will proceed with the next request. This works fine and the script continues with the remaining websites..

The problem occurs when the console command connection to Mysql has itself been broken. then core:archive will die any remaining websites and segments will not be pre-processed.

Reproduce

To reproduce: during a core:archive run, restart the Mysql server.

the core:archive script will error with an error:

INFO CoreConsole[2015-04-13 04:28:00] Archived website id = 83, period = month, 0 visits in last last52 months, 0 visits this month, Time elapsed: 0.447s
INFO CoreConsole[2015-04-13 04:28:00] Will pre-process for website id = 83, year period, the following 1 segments: { pageUrl!=xx } 
ERROR CoreConsole[2015-04-13 04:28:01] Error unserializing the following response from http://localhost/piwik-master/index.php?module=API&method=API.get&idSite=83&period=year&date=last7&format=php&token_auth=removed&trigger=archivephp: SQLSTATE[HY000] [2003] Can't connect to MySQL server on '127.0.0.1' (111) 
INFO CoreConsole[2015-04-13 04:28:01] Archived website id = 83, period = year, 0 visits in last last7 years, 0 visits this year, Time elapsed: 0.469s
ERROR CoreConsole[2015-04-13 04:28:01] SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted



  [Exception]                                                               
  SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted  

Here is an error that happened in production Piwik:

2015-03-21 07:27:54 INFO CoreConsole[2015-03-21 07:27:54] [4d9c4] Archived website id = 9, period = week, 40731912 visits in last last4 weeks, 7821120 visits this week, Time elapsed: 195.531s

2015-03-21 16:06:26 INFO CoreConsole[2015-03-21 16:06:26] [4d9c4] Archived website id = 9, period = month, 157882146 visits in last last4 months, 31272773 visits this month, Time elapsed: 31111.368s

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

2015-03-21 16:06:26 ERROR CoreConsole[2015-03-21 16:06:26] [4d9c4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

Proposed solution

The goal is to prevent such error and proceeding with archiving of remaining websites.

  • when an error occured for a request, do not re-try it: it's important to skip the failed request. For example if the archiving failed for a website because memory exhaust, it would fail again the next time. by skipping we ensure the remaining sites will be processed.
  • when a DB breaks temporarily, the script should be smart and wait a bit to retry.
    • could a retry mechanism re-connect to database after 1s and retry N times, logging a warning or so, be implemented in the command?
    • how to detect such errors? error codes 70100 and 2006 are a starting point
@mattab mattab added the Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. label Apr 13, 2015
@SR-mkuhn
Copy link

It seems like, for every site two mysql connections are established.
The first one sits idle and the other one does some work.
If the first one reaches wait_timeout everything fails.

One solution could be eliminating the idle connection.

@nekohayo
Copy link

nekohayo commented Sep 7, 2015

So as I understand it, the error is "harmless" in the sense that the user can just repeatedly re-run the console archive command until the thing succeeds? There will be no incomplete/corrupt data?

As it stands, the official technique involving "console core:archive" does not work at all on shared hosts like Hostgator's; on mine, I am unable to process piwik tracking logs on the commandline through SSH even for some fairly low/medium traffic websites (the piwik instance I'm currently looking at only has 38 thousand rows of tracking log tables, weighting about 6.4 MB). From what I understood of my discussions with HG's support staff, MySQL processes get killed when they use too much CPU time for too long.

@nekohayo
Copy link

nekohayo commented Sep 8, 2015

FWIW, I received this additional piece of info/suggestion from Hostgator today:

Since the script is being killed due to CPU limitations on our shared and reseller servers, the suggested solution would be to configure the script to perform its operations in batches instead of attempting to perform all of the work during a single process and MySQL connection. If the script can be configured to spawn separate processes to handle the work in batches, you can avoid the issue of a single process hitting the CPU limits and being killed

I don't know if that is a valid suggestion, and if it is, if it should be filed as a separate issue?

@mattab
Copy link
Member Author

mattab commented Sep 8, 2015

@nekohayo The core:archive already will trigger a separate process for each of the archiving requests. so this should already be implemented

@mattab
Copy link
Member Author

mattab commented Sep 18, 2015

as @SR-mkuhn pointed out, the common workaround to solve this issue (in most cases) is to increase wait_timeout in Mysql configuration.

@nekohayo
Copy link

You can't actually do that as a user of a shared host, no?

@SR-mkuhn
Copy link

@mattab, but wouldn't it be better to work with one mysql connection instead of two AND requirement of increasing wait_timeout?
One connection seems to gather data and at the end, the other idle one writes something.
If the first connection has to do a lot - the first send command by the other connection exceeds the timeout limit.
If those database connections would be merged - a increase of wait_timeout should be absolutly unnecessary.

@bartek85
Copy link

Maybe I can give you some more data to help you? I have a problem after upgrade from 2.12.0 to 2.14.3, it goeas away after archiving one website and switching to another.

wait_timeout=28800 (it's not the case)
max_allowed_packet_size=512MB (not the case, changing doeasn't made any difference, on 2.12.0 - 128MB - same data)
connection_timeout=28800 - that one helped a bit, but on 2.12.0 it's set to "10" and everything is working.

It's new archive (droped archive tables) with 4,4k websites and 40+GB piwik_log_visit.

@monkz
Copy link

monkz commented Jan 9, 2016

Hi,
i think i might found a solution to this problem, altough i have no patch.

As described by @SR-mkuhn , core:archive seems to use two mysql connections.
This is because the singleton in https://github.com/piwik/piwik/blob/master/core/Db/Adapter/Pdo/Mysql.php loses its context.
CliMulti is the reason for it - it spawns a seperate process for a request and this process doesn't have any information about the mysql connection in the parent process.
While waiting for those worker processes to finish, there is no action in the mysql connection of the parent, so we run into the wait_timeout threshold and everything fails.

My recommendation is:
While waiting on child processes in https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php , execute a NOOP query once or twice during the period of wait_timeout inside the parent.
Select 1;, SET @ignore_me = 0; or DO 0; should be sufficient. http://lists.mysql.com/mysql/142004
Like every PDO gets a noop function and we can call it from CliMulti.

Greetings
MonkZ

@SR-mkuhn
Copy link

Looks good.

@twanmeijerink
Copy link

Hi @monkz,
we seem to have the same problem as you described. However, we do not really understand your solution. Can you describe in more detail where you exactly added the NOOP query?

Thanks in advance, Twan

@monkz
Copy link

monkz commented Apr 7, 2016

Hi @twanmeijerink ,
so far the concept of a noop is currently not implemented and i'm not a php developer.
It seems like, if a child process needs longer, the parent dies due to inactivity.
A NOOP command might be introduced here: https://github.com/piwik/piwik/blob/master/core/CliMulti/Process.php#L113

My solution is just a recommendation/suggestion for @mattab and @tsteur .

@mattab mattab modified the milestones: Long term, Mid term Dec 5, 2016
@geek-merlin
Copy link

geek-merlin commented Feb 4, 2018

I solved a similar problem recently.
Bottom line:

  • Doing regular "NOP" sql queries needs complex code changes. Same for a reconnect logic.
  • Increasing wait_timeout generally as suggested here is bad, this should not be done for ordinary web requests (DOS!)
  • Increasing wait_timeout for the current session is imho the way to go and as simple via SQL as (say) SET SESSION wait_timeout = 900.

@Jon-IB
Copy link

Jon-IB commented Nov 13, 2018

This is not just a matter of wait_timeout. I'm trying to archive 125 days of traffic on a lot of low-traffic sites, because I'm migrating to a new Piwik instance (long story). This causes MySQL to kill the connection. If the number of days is set to 125, then I usually get a failure within 30 seconds of the start of the archiving process.

I haven't figured out if the error is because of the size of the received packet, or perhaps even the number of rows to insert is too large.

From https://dev.mysql.com/doc/refman/5.7/en/gone-away.html :

An INSERT or REPLACE statement that inserts a great many rows can also cause these sorts of errors.

My wait_timeout appears to be set to 8 hours. I see the failures when the archiver is working on sites with more traffic, but the sites in question only have ~ 3000 visits in the last couple months. That doesn't seem like a lot of data to process at a time.

Regardless of the cause, it would be great if the archiver recovered from this error.

@mattab
Copy link
Member Author

mattab commented May 4, 2020

Thanks everyone for your feedback. We will very likely not implement this in the next few months so I'm closing this issue as wontfix for now.

@mattab mattab closed this as completed May 4, 2020
@mattab mattab added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label May 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.
Projects
None yet
Development

No branches or pull requests

8 participants