@mattab opened this Issue on February 20th 2019 Member

In order to improve overall performance of Matomo (both at Tracker and Reports processing) an idea we have is to split log_* Tracker tables into two sets of tables: hot data VS cold data.

eg hot data would be data for the last 48 hours or so. Cold data would be any data older than 48 hours. As a result the "hot data" table could be 100 times smaller than the cold data (depending on data retention settings see https://matomo.org/docs/managing-your-databases-size/ )

Having tables split in this way means that:

  • Tracking requests is faster (inserting into smaller tables takes less time)
  • Archiving reports is faster (querying the same datasets but over smaller log_ tables will be faster)

There will be a few changes needed eg. need a process to move data from hot to cold, need to update SQL query generators to query the right tables as applicable (and possibly getting data from the two tables eg. Visitor log requests over multiple days), etc.

@tsteur commented on February 20th 2019 Member

It'll be also beneficial for when there are DB updates on log tables.

@tsteur commented on March 27th 2019 Member

Before working on this or as part of this issue may be good to see if horizontal MySQL partitioning can help like PARTITION BY HASH( MONTH(visit_last_action_time) ).
I think in most queries where performance matters we already do mention the visit_last_action_time and/or server_time but that would need to be checked.

@mattab commented on July 25th 2019 Member

It would be great to maybe spend 2-4 days on this project and see what is possible to achieve. Making this change in the database would greatly increase performance of most SQL queries executed by Matomo. (also results in less CPUs usage overall/ more efficient.)

@tsteur commented on August 8th 2019 Member

BTW: MySQL Parititions would have been great. Like we could have done PARTITION BY HASH(YEAR(time) Month(time)) PARTITIONS 24; which would have put data for each month in a separate table eg when using a data retention of 24 months. But, we cannot use any of the time columns for partitions since that column would need to be present in all unique indexes including the primary key.

@tsteur commented on August 8th 2019 Member

I can think here of many possible solutions. Many of the solutions have quite a few downsides or are not easy to develop. Only mentioning some therefore. And partitions is not really an option as we've already seen. Two of them I would like to mention in more detail.

Writing tracking requests into two tables eg log_visit and log_visit_cold (and similar for other log tables)

  • log_visit holds only the data for the last three days and it is used for live queries that affect the last 2 days and for archiving queries for yesteday and today.
  • log_visit_cold holds all entries including the most recent
  • When there is a tracking request, we write each entry into both tables
  • log_action and some other tables would not have a cold table likely to make sure they all use the same IDs but might actually not be an issue as long as we never use a union
  • There's a cronjob deleting data from log_visit once a day for data that is older than say 3 days (we need to store data for 3 days and not 2 days re different timezones etc)

Pros

  • We don't need to move data
  • Not needing to use any unions etc. We either go onto the regular log table or the cold table
  • Live queries can leverage the small log table

Cons

  • Needing to write a lot more since we write into 2 tables
  • Problems when there are schema updates and the schema maybe mismatches etc but that's a problem anyway...
  • Especially when we start using it need to exactly know what data for what date range is in which table available
  • Archiving segments for older data still slow
  • Deleting data from hot table may be a bit slow
  • I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Writing tracking requests into only one table eg log_visit and data is moved periodically to a log_visit_cold table (and similar for other log tables)

  • log_visit would only contain data of the last three days
  • Will only work after opting in through a setting very likely as it requires browser archiving to be disabled
  • log_action would quite certainly not have a
  • Whenever archiver starts, we copy all most recent data from log_visit to log_visit_cold using transactions so we can have fast bulk inserts

Pros

  • Fast writing during tracking making tracking quite a bit faster
  • When we have schema updates on log tables, these will run a lot faster on the log table that have the last 3 days of data meaning we can apply log table changes a LOT faster without much impacting tracking.
  • Live querries can leverage the small log table

Cons

  • Needing to shift data around
  • Especially when we start using it need to exactly know what data for what date range is in which table available
  • Archiving segments for older data still slow
  • Deleting data from cold table may be a bit slow
  • I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

  • Similar to above we move all recent data before the archiving from a hot table into log_visit etc
  • Then we have for each day of the last 3-4 (or 7) days one log table like log_visit_2019_08_09

Pro

  • This solves the problem of slow deletes because we can simply drop the entire table, and it solves the problem that for sure we don't need to run optimize table on this ever.
  • Makes archiving even faster
  • Easy to know which data is available in which table... like we can use this solution pretty much right away from enabling it (more or less)

Cons

  • We will still need to see with log_action and when moving entries around that they have in the end the same idVisit as they should have.... not sure we can solve this easily

Log table for each month, say log_visit_2019_12

Similar to the archive tables. As MySQL permits up to 4 billion tables could even do this per day.

Pros

  • Archiving segments for older data also bit faster

Cons

  • Lots of unions needed when calculating unique visitors for a year or date ranges
  • Might need to split the initial data into all these tables which could take a while
  • Still 30 days of data in one log table where it could only have eg 3 days of data but that be still fastish

MySQL views with temporary tables...

  • Would need to create them on demand... they would be deleted the connection ends and then need to be recreated...
  • Temp table is only available to the user that creates it... would be only something to improve archiving time
  • Would not improve anything on tracking etc
  • Likely taking too long to create some views
  • Could maybe partially even create views with the segment already applied? Not sure... eg if we change the meaning of Segments to do a idVisit IN (segmen queriy) then this could be actually a game changer
  • Heaps of I/O I suppose... and heaps of temporary tables => storage
  • Would also make archiving of dates in the past faster
  • Creating a temp table with 1M rows took about 30s on a server which is not too bad ====> However, we actually only need to create a temp table with the idVisit so this will make it much faster
  • Temp table does not have any indexes... so not sure how that affects things
  • We could use this technology right away without any complications
  • Wouldn't really help with unique visitor queries for week, month, ... though
  • Will help though a lot with segments as we apply segment only once
  • Easy to develop likely
  • Fixes the segment issue which would be otherwise hard to fix

General

  • Need to see that eg GDPR manager also deletes data from all tables
  • If we didn't have the unique visitors for weeks, months, ... everything be a lot easier and we could even just use a log table per day or week or ...
  • Fast schema updates without much down time be a big win
  • log_action etc need to see that IDs match up
  • If we delete data from a hot table, and an archiver has been running for a long time like > 1 day, we would need to be careful that we don't delete data which is being archived currently but that seems very much edge case and it be safe when we have say last 3 days of data... any only use it to archive last 2 days...
@tsteur commented on August 14th 2019 Member

Before implementing the hot / cold tables we want to validate if it makes archiving actually faster. Eg the hot table would store data for the last 3 days, then the question is does it become faster to archive today and yesterday when using the hot tables or not?

It's clear that a hot table will make tracking faster since there are will be less indexes and we can have different index on hot vs cold table. The inserts and updates be faster for sure.

Another advantage of the hot table is, that schema updates on the hot table will be much faster and basically barely affect tracking as there will be say only 3 days worth of data instead of 1,2,3,4 or 5 years or more data. So making schema changes be a lot faster. Updating the schema on the cold table could run in the background and we would simply wait copying data over from hot to cold until that schema update is done.

Tests show it's also really fast to copy data over from hot to cold table.

To test archiving performance we could do things like this:

CREATE TABLE log_link_visit_action_hot LIKE log_link_visit_action;
CREATE TABLE log_visit_hot LIKE log_visit

INSERT log_visit_hot SELECT * FROM log_visit where idsite in(...) and visit_last_action_time >= '2019-08-12 00:00:00';
INSERT log_link_visit_action_hot SELECT * FROM log_link_visit_action where idsite in(...) and server_time >= '2019-08-12 00:00:00';

select SQL_NO_CACHE count(*) from log_visit_hot left join log_link_visit_action_hot on log_visit_hot.idvisit = log_link_visit_action_hot.idvisit where log_visit_hot.idsite= 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;

select SQL_NO_CACHE count(*) from log_visit left join log_link_visit_action on log_visit.idvisit = log_link_visit_action.idvisit where log_visit.idsite = 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;

-- and at the end:
drop table log_visit_hot;
drop table log_link_visit_action_hot;
@siva538 commented on August 18th 2019 Contributor

+1.

Current challenge is that wherever we wanted to retrieve the detailed transaction data like segmented visitor log for a given "event action/category/name" etc, the system has to spin through entire 100's of Millions of records on log_link_visit_action table, and it is quite a burden on the MYSQL to spin through all of them (one by one).

Instead, having multiple copies (probably partition by visit action date) or some other attribute makes the database maintenance as well much easier, and overall system faster.

Regds,
Sivakumar

Powered by GitHub Issue Mirror