DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

mattab · 2019-02-20T04:04:35Z

In order to improve overall performance of Matomo (both at Tracker and Reports processing) an idea we have is to split log_* Tracker tables into two sets of tables: hot data VS cold data.

eg hot data would be data for the last 48 hours or so. Cold data would be any data older than 48 hours. As a result the "hot data" table could be 100 times smaller than the cold data (depending on data retention settings see https://matomo.org/docs/managing-your-databases-size/ )

Having tables split in this way means that:

Tracking requests is faster (inserting into smaller tables takes less time)
Archiving reports is faster (querying the same datasets but over smaller log_ tables will be faster)

There will be a few changes needed eg. need a process to move data from hot to cold, need to update SQL query generators to query the right tables as applicable (and possibly getting data from the two tables eg. Visitor log requests over multiple days), etc.

tsteur · 2019-02-20T06:15:30Z

It'll be also beneficial for when there are DB updates on log tables.

tsteur · 2019-03-27T23:36:08Z

Before working on this or as part of this issue may be good to see if horizontal MySQL partitioning can help like PARTITION BY HASH( MONTH(visit_last_action_time) ).
I think in most queries where performance matters we already do mention the visit_last_action_time and/or server_time but that would need to be checked.

mattab · 2019-07-25T03:51:40Z

It would be great to maybe spend 2-4 days on this project and see what is possible to achieve. Making this change in the database would greatly increase performance of most SQL queries executed by Matomo. (also results in less CPUs usage overall/ more efficient.)

tsteur · 2019-08-08T00:56:31Z

BTW: MySQL Parititions would have been great. Like we could have done PARTITION BY HASH(YEAR(time) Month(time)) PARTITIONS 24; which would have put data for each month in a separate table eg when using a data retention of 24 months. But, we cannot use any of the time columns for partitions since that column would need to be present in all unique indexes including the primary key.

tsteur · 2019-08-08T20:29:48Z

I can think here of many possible solutions. Many of the solutions have quite a few downsides or are not easy to develop. Only mentioning some therefore. And partitions is not really an option as we've already seen. Two of them I would like to mention in more detail.

Writing tracking requests into two tables eg log_visit and log_visit_cold (and similar for other log tables)

log_visit holds only the data for the last three days and it is used for live queries that affect the last 2 days and for archiving queries for yesteday and today.
log_visit_cold holds all entries including the most recent
When there is a tracking request, we write each entry into both tables
log_action and some other tables would not have a cold table likely to make sure they all use the same IDs but might actually not be an issue as long as we never use a union
There's a cronjob deleting data from log_visit once a day for data that is older than say 3 days (we need to store data for 3 days and not 2 days re different timezones etc)

Pros

We don't need to move data
Not needing to use any unions etc. We either go onto the regular log table or the cold table
Live queries can leverage the small log table

Cons

Needing to write a lot more since we write into 2 tables
Problems when there are schema updates and the schema maybe mismatches etc but that's a problem anyway...
Especially when we start using it need to exactly know what data for what date range is in which table available
Archiving segments for older data still slow
Deleting data from hot table may be a bit slow
I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Writing tracking requests into only one table eg log_visit and data is moved periodically to a log_visit_cold table (and similar for other log tables)

log_visit would only contain data of the last three days
Will only work after opting in through a setting very likely as it requires browser archiving to be disabled
log_action would quite certainly not have a
Whenever archiver starts, we copy all most recent data from log_visit to log_visit_cold using transactions so we can have fast bulk inserts

Pros

Fast writing during tracking making tracking quite a bit faster
When we have schema updates on log tables, these will run a lot faster on the log table that have the last 3 days of data meaning we can apply log table changes a LOT faster without much impacting tracking.
Live querries can leverage the small log table

Cons

Needing to shift data around
Especially when we start using it need to exactly know what data for what date range is in which table available
Archiving segments for older data still slow
Deleting data from cold table may be a bit slow
I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

Similar to above we move all recent data before the archiving from a hot table into log_visit etc
Then we have for each day of the last 3-4 (or 7) days one log table like log_visit_2019_08_09

Pro

This solves the problem of slow deletes because we can simply drop the entire table, and it solves the problem that for sure we don't need to run optimize table on this ever.
Makes archiving even faster
Easy to know which data is available in which table... like we can use this solution pretty much right away from enabling it (more or less)

Cons

We will still need to see with log_action and when moving entries around that they have in the end the same idVisit as they should have.... not sure we can solve this easily

Log table for each month, say log_visit_2019_12

Similar to the archive tables. As MySQL permits up to 4 billion tables could even do this per day.

Pros

Archiving segments for older data also bit faster

Cons

Lots of unions needed when calculating unique visitors for a year or date ranges
Might need to split the initial data into all these tables which could take a while
Still 30 days of data in one log table where it could only have eg 3 days of data but that be still fastish

MySQL views with temporary tables...

Would need to create them on demand... they would be deleted the connection ends and then need to be recreated...
Temp table is only available to the user that creates it... would be only something to improve archiving time
Would not improve anything on tracking etc
Likely taking too long to create some views
Could maybe partially even create views with the segment already applied? Not sure... eg if we change the meaning of Segments to do a idVisit IN (segmen queriy) then this could be actually a game changer
Heaps of I/O I suppose... and heaps of temporary tables => storage
Would also make archiving of dates in the past faster
Creating a temp table with 1M rows took about 30s on a server which is not too bad ====> However, we actually only need to create a temp table with the idVisit so this will make it much faster
Temp table does not have any indexes... so not sure how that affects things
We could use this technology right away without any complications
Wouldn't really help with unique visitor queries for week, month, ... though
Will help though a lot with segments as we apply segment only once
Easy to develop likely
Fixes the segment issue which would be otherwise hard to fix

General

Need to see that eg GDPR manager also deletes data from all tables
If we didn't have the unique visitors for weeks, months, ... everything be a lot easier and we could even just use a log table per day or week or ...
Fast schema updates without much down time be a big win
log_action etc need to see that IDs match up
If we delete data from a hot table, and an archiver has been running for a long time like > 1 day, we would need to be careful that we don't delete data which is being archived currently but that seems very much edge case and it be safe when we have say last 3 days of data... any only use it to archive last 2 days...

tsteur · 2019-08-14T23:49:25Z

Before implementing the hot / cold tables we want to validate if it makes archiving actually faster. Eg the hot table would store data for the last 3 days, then the question is does it become faster to archive today and yesterday when using the hot tables or not?

It's clear that a hot table will make tracking faster since there are will be less indexes and we can have different index on hot vs cold table. The inserts and updates be faster for sure.

Another advantage of the hot table is, that schema updates on the hot table will be much faster and basically barely affect tracking as there will be say only 3 days worth of data instead of 1,2,3,4 or 5 years or more data. So making schema changes be a lot faster. Updating the schema on the cold table could run in the background and we would simply wait copying data over from hot to cold until that schema update is done.

Tests show it's also really fast to copy data over from hot to cold table.

To test archiving performance we could do things like this:

CREATE TABLE log_link_visit_action_hot LIKE log_link_visit_action;
CREATE TABLE log_visit_hot LIKE log_visit

INSERT log_visit_hot SELECT * FROM log_visit where idsite in(...) and visit_last_action_time >= '2019-08-12 00:00:00';
INSERT log_link_visit_action_hot SELECT * FROM log_link_visit_action where idsite in(...) and server_time >= '2019-08-12 00:00:00';


select SQL_NO_CACHE count(*) from log_visit_hot left join log_link_visit_action_hot on log_visit_hot.idvisit = log_link_visit_action_hot.idvisit where log_visit_hot.idsite= 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;


select SQL_NO_CACHE count(*) from log_visit left join log_link_visit_action on log_visit.idvisit = log_link_visit_action.idvisit where log_visit.idsite = 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;

-- and at the end:
drop table log_visit_hot;
drop table log_link_visit_action_hot;

siva538 · 2019-08-18T13:06:58Z

+1.

Current challenge is that wherever we wanted to retrieve the detailed transaction data like segmented visitor log for a given "event action/category/name" etc, the system has to spin through entire 100's of Millions of records on log_link_visit_action table, and it is quite a burden on the MYSQL to spin through all of them (one by one).

Instead, having multiple copies (probably partition by visit action date) or some other attribute makes the database maintenance as well much easier, and overall system faster.

Regds,
Sivakumar

mattab · 2019-10-22T20:45:36Z

This improvement has a lot of potential to reduce IO used by Matomo!

MySQL views with temporary tables...
Now that we implemented the faster archiving in #14761 this solution above is not applicable anymore i suppose?

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

This solution sounds the most optimal so far! 🚀

the question is does it become faster to archive today and yesterday when using the hot tables or not?

Be interesting to measure for sure 👍 reckon it will make archiving a bit faster as less index lookups, less read IOs because of smaller tables to join...

Fast schema updates without much down time be a big win

Yes! also no need for Optimise tables would be big win as it's caused issues in the past.

tsteur · 2019-10-22T20:58:39Z

the question is does it become faster to archive today and yesterday when using the hot tables or not?

Be interesting to measure for sure 👍 reckon it will make archiving a bit faster as less index lookups, less read IOs because of smaller tables to join...

I did that and it wasn't making it much faster AFAIK.

Also I think the "IO problem" can't really be solved by this so much since the IO is mostly related to tracking and it will quite likely approx be the same IO no matter if it tracks into a big table or smaller table per month or day... Maybe the index update on each tracking request be a bit smaller but it wouldn't save that much on IO I suppose.

tsteur · 2019-10-26T18:25:22Z

What makes the hot/cold tables very hard / not worth to implement are the site timezones btw. This means we need to rewrite archivers to use unions in SQL queries. No matter if the data is sharded by day, week, month... Even in tracking we sometimes would need to look at multiple log tables for different months/weeks to find a previous visit. For live queries it means executing eg one query per month which can be slowish too. It's not really doable I reckon.

tsteur · 2019-12-16T02:18:05Z

Can we close this issue see last comment?

mattab · 2020-01-08T23:15:48Z

@tsteur 👍 we can close it and always re-open later if we decide it's worth it again.

mattab added Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. c: Performance For when we could improve the performance / speed of Matomo. labels Feb 20, 2019

mattab added this to the Priority Backlog (Help wanted) milestone Jun 18, 2019

tsteur mentioned this issue Jul 21, 2019

Personalisation of content using Matomo custom segments #13553

Closed

mattab modified the milestones: Priority Backlog (Help wanted), 4.0.0 Jul 25, 2019

This was referenced Aug 13, 2019

Faster segment archiving #14761

Merged

Transitions report very slow or fails to load without extra INDEX #14762

Open

tsteur closed this as completed Jan 8, 2020

mattab added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

mattab commented Feb 20, 2019

tsteur commented Feb 20, 2019

tsteur commented Mar 27, 2019

mattab commented Jul 25, 2019

tsteur commented Aug 8, 2019

tsteur commented Aug 8, 2019

tsteur commented Aug 14, 2019

siva538 commented Aug 18, 2019

mattab commented Oct 22, 2019

tsteur commented Oct 22, 2019

tsteur commented Oct 26, 2019

tsteur commented Dec 16, 2019

mattab commented Jan 8, 2020

DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

Comments

mattab commented Feb 20, 2019

tsteur commented Feb 20, 2019

tsteur commented Mar 27, 2019

mattab commented Jul 25, 2019

tsteur commented Aug 8, 2019

tsteur commented Aug 8, 2019

Writing tracking requests into two tables eg log_visit and log_visit_cold (and similar for other log tables)

Pros

Cons

Writing tracking requests into only one table eg log_visit and data is moved periodically to a log_visit_cold table (and similar for other log tables)

Pros

Cons

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

Pro

Cons

Log table for each month, say log_visit_2019_12

Pros

Cons

MySQL views with temporary tables...

General

tsteur commented Aug 14, 2019

siva538 commented Aug 18, 2019

mattab commented Oct 22, 2019

tsteur commented Oct 22, 2019

tsteur commented Oct 26, 2019

tsteur commented Dec 16, 2019

mattab commented Jan 8, 2020