@mgazdzik opened this Issue on March 27th 2015 Contributor

Goal of this ticket is to discuss and hopefully implement solution allowing single Piwik instance to query two replicant databases.

Outline of issue:

  • big Piwik instances often have master-slave db replication,
  • currently only use (in our case) of slave servers is to perform db backups, and in emergency cases to switch into master to provide HA

However slave servers could be put to much better use if we can redirect archiving queries onto slave db.
Long running and heavy queries could be triggered against slave, then results processed in PHP and write archives to master server.
This would significantly relieve tracker, take away some other problems with db performance when traffic peak meets long running archiving.

For initial scope, maybe only select queries triggered by LogAggregator could be launched against auxiliary db server, and in future we add option for developers to decide which db to use.
Even putting basic LogAggregator queries into slave would give significant boost to performance.

What do you think?

@jloh commented on March 28th 2015 Contributor

Yeah it would be great for you could specify a read only database to use. I would be hesitant to use it for archiving or something of the sort though, what if the data is out of sync and you don't know it?

A read only dB would be great for use in the front end when displaying reports so as not to take up connections on the master server

@mattab commented on March 29th 2015 Member

This would significantly relieve tracker, take away some other problems with db performance when traffic peak meets long running archiving.

To take away problems with Tracker performance let's try first using QueueTracking: https://github.com/piwik/plugin-QueuedTracking

after using this plugin, performance will get much better as it's wrapped within a transaction. Maybe we will still have tracker IO contention issues after using QueuedTracking, but let's try this first (it will happen in the next days and weeks)

@tsteur commented on April 23rd 2015 Member

As @Mooash mentioned for archiving it might not work easily / might be risky since there can be a delay on the slave etc.

Reports in the UI / API shouldn't bother the database too much as they are more or less simple selects and there shouldn't be too many entries. Starting from Piwik 2.13 the archive blob tables should have much less entries and be faster. The archive tables can become maybe a problem when having like thousands of websites and thousands of segments.

It might be worth for the Live API if one queries it a lot but then again if there's a delay it might show outdated data and that's not always ok.

The problem is rather the log tables. Here, one can purge log entries eg > 6 months when those tables become too big. Partitioning might also help, haven't tried it yet. Other DB's like TokuDB might help as well.

As you said using slave for archiving it would be nice but probably can have side effects. Would require some investigation. @mattab If tracker performance is too slow because of long running archiving all the time, it will make inserting from Queue to DB rather slow as well. Queue does not really have a huge performance improvement over the Tracker if at all. The queue was mainly built to handle peaks easily and to avoid having too many connections to the database etc, not for performance reasons. The insertion from Queue to DB can be faster than normal tracking, but that's rather because of having things cached in PHP across requests and not necessarily related to DB. Depending on the number of log entries it can be faster because of transactions but doesn't have to be.

@mattab commented on April 27th 2015 Member

As you said using slave for archiving it would be nice but probably can have side effects. Would require some investigation.

what comes to mind after quickly thinking about this:

  • when we run archiving queries on slave, how do we make sure the slave is not "too far behind" so we don't run queries on invalid / incompete log_* data ?
  • to run archiving on the slave, if we query log_* tables on the slave and insert the archive blobs in master DB, then I guess the only way to do this is to connect to 2 DBs at once in same PHP process and decide where to read/write from.
  • to do: investigate further what edge cases there are
@tsteur commented on April 27th 2015 Member

Those kinda things came to my mind as well. It'd be worth it, if it actually relieves the database / tracking.

Therefore more importantly, before working on this, we need to profile the database on an instance where there are problems and collect "evidence" that this is actually a problem and that it would make things "faster" and how much etc.

@quba commented on June 17th 2015 Contributor

There are 2 situations where this approach definitely will help:

  • querying Live API (or in general APIs) - especially Live API with segments applied,
  • realtime segments - in worst case scenario we would have DB replication lag instead of issues with DB used to track visitors.
@esapozhnikov commented on July 21st 2015

This feature would be very helpful.

Currently the Db deadlocks cause the front end to fail to load.
For people like myself who use a separate server for archiving it would be very useful to have a front end with read-only Db so that the front end always loads reliably.

This Issue was closed on July 22nd 2019
Powered by GitHub Issue Mirror