@tsteur opened this Pull Request on June 19th 2020 Member

MySQL wait locks can cause some performance issues on high traffic/concurrency loads. For example the following scenario:

  • There's a bulk request <a href='/1'>#1</a> updating various entries in a transaction
  • Another tracking request <a href='/2'>#2</a> eg for same visitor comes in
  • There may be a lock wait time for <a href='/2'>#2</a> because it's waiting for the transaction in <a href='/1'>#1</a> to finish.
  • By default, the tracking request may wait to get the lock for 30 to 50 seconds (MySQL default).
    • More and more requests might be coming in for the same visitor and suddenly you have say a few hundred or thousands requests waiting for 30 to 50 seconds and/or failing which could make an auto scaling system think the servers or instances aren't healthy and considers them all as unhealthy

To address this issue, the thought is to configure a shorter lock wait time so these locks aren't held for too long.

When an error 1205 Lock wait timeout exceeded; try restarting transaction happens, then MySQL rolls back the current statement. Because there might be multiple inserts or updates within one tracking request we can't really safely replay the entire tracking request (unless innodb_rollback_on_timeout=1 was enabled). This means for now the behaviour be as follows:

  • For a single tracking request we don't do anything and simply fail
  • For a request within the bulk tracking request we skip the request that had a lock wait time compared to previous behaviour where it would have failed all tracking requests and rolled back everything.

We need to see later if we maybe need to respond with a success tracking request when this happens but that's to be seen (to avoid having too many error requests making instances unhealthy). If any possible we don't want to do this.

A note on deadlocks

Thought I also quickly go into deadlocks in case someone thinks that's related which it isn't quite related.

This change does not affect deadlocks which may happen as part of bulk requests from the JS tracker. Deadlocks usually don't happen for log analytics or queued tracking because they don't send bulk requests in parallel for the same visitor but process them one after another.

Depending on the MySQL distribution (MySQL, MariaDB, RDS, Aurora, ...) or version these might be detected immediately without waiting for a few seconds. For some MySQL instances this behaviour depends on the MySQL innodb_deadlock_detect variable. FYI: Having this variable enabled can slow down performance on a high load MySQL.

When a deadlock happens, MySQL rolls back the entire transactions. This means the current behaviour is that the tracking request fails and nothing in the DB is changed. It's like the bulk request didn't happen.

We could eventually catch deadlock errors and try to reply them once. This would be probably a separate PR.

You can also workaround by configuring [Tracker]bulk_requests_use_transaction=0 in Matomo but then you need to expect more DB CPU usage and slower average response times cause more indexes are written etc.

How can a lock wait timeout happen?

  • session 1: start transaction
  • session 1: update site set name = '...' where idsite = 1
  • session 2: update site set name = '...' where idsite = 1

Because the session 1 is not committed yet, the session 2 needs to wait for the session 1 to finish before it can have the row lock to update the site.

How does a deadlock happen?

Basically deadlocks happen when there are two DB sessions and they both wait for a lock that the other one holds. Basically they both want the same locks but they get them in the opposite order like this:

  • session 1: start transaction
  • session 2: start transaction
  • session 1: update site set name = '...' where idsite = 1
  • session 2: update site set name = '...' where idsite = 2
  • session 1: update site set name = '...' where idsite = 2
  • session 2: update site set name = '...' where idsite = 1

Basically session 2 waits for the lock to update site 1 which session 1 holds, but session 1 waits at the same time for a lock to update site 2 which session 2 has.

This is my understanding of how things work. I might be wrong :)

This Pull Request was closed on June 21st 2020
Powered by GitHub Issue Mirror