Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119

Closed
mattab opened this issue Feb 20, 2019 · 12 comments
Labels
c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Feb 20, 2019

In order to improve overall performance of Matomo (both at Tracker and Reports processing) an idea we have is to split log_* Tracker tables into two sets of tables: hot data VS cold data.

eg hot data would be data for the last 48 hours or so. Cold data would be any data older than 48 hours. As a result the "hot data" table could be 100 times smaller than the cold data (depending on data retention settings see https://matomo.org/docs/managing-your-databases-size/ )

Having tables split in this way means that:

  • Tracking requests is faster (inserting into smaller tables takes less time)
  • Archiving reports is faster (querying the same datasets but over smaller log_ tables will be faster)

There will be a few changes needed eg. need a process to move data from hot to cold, need to update SQL query generators to query the right tables as applicable (and possibly getting data from the two tables eg. Visitor log requests over multiple days), etc.

@mattab mattab added Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. c: Performance For when we could improve the performance / speed of Matomo. labels Feb 20, 2019
@tsteur
Copy link
Member

tsteur commented Feb 20, 2019

It'll be also beneficial for when there are DB updates on log tables.

@tsteur
Copy link
Member

tsteur commented Mar 27, 2019

Before working on this or as part of this issue may be good to see if horizontal MySQL partitioning can help like PARTITION BY HASH( MONTH(visit_last_action_time) ).
I think in most queries where performance matters we already do mention the visit_last_action_time and/or server_time but that would need to be checked.

@mattab
Copy link
Member Author

mattab commented Jul 25, 2019

It would be great to maybe spend 2-4 days on this project and see what is possible to achieve. Making this change in the database would greatly increase performance of most SQL queries executed by Matomo. (also results in less CPUs usage overall/ more efficient.)

@tsteur
Copy link
Member

tsteur commented Aug 8, 2019

BTW: MySQL Parititions would have been great. Like we could have done PARTITION BY HASH(YEAR(time) Month(time)) PARTITIONS 24; which would have put data for each month in a separate table eg when using a data retention of 24 months. But, we cannot use any of the time columns for partitions since that column would need to be present in all unique indexes including the primary key.

@tsteur
Copy link
Member

tsteur commented Aug 8, 2019

I can think here of many possible solutions. Many of the solutions have quite a few downsides or are not easy to develop. Only mentioning some therefore. And partitions is not really an option as we've already seen. Two of them I would like to mention in more detail.

Writing tracking requests into two tables eg log_visit and log_visit_cold (and similar for other log tables)

  • log_visit holds only the data for the last three days and it is used for live queries that affect the last 2 days and for archiving queries for yesteday and today.
  • log_visit_cold holds all entries including the most recent
  • When there is a tracking request, we write each entry into both tables
  • log_action and some other tables would not have a cold table likely to make sure they all use the same IDs but might actually not be an issue as long as we never use a union
  • There's a cronjob deleting data from log_visit once a day for data that is older than say 3 days (we need to store data for 3 days and not 2 days re different timezones etc)

Pros

  • We don't need to move data
  • Not needing to use any unions etc. We either go onto the regular log table or the cold table
  • Live queries can leverage the small log table

Cons

  • Needing to write a lot more since we write into 2 tables
  • Problems when there are schema updates and the schema maybe mismatches etc but that's a problem anyway...
  • Especially when we start using it need to exactly know what data for what date range is in which table available
  • Archiving segments for older data still slow
  • Deleting data from hot table may be a bit slow
  • I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Writing tracking requests into only one table eg log_visit and data is moved periodically to a log_visit_cold table (and similar for other log tables)

  • log_visit would only contain data of the last three days
  • Will only work after opting in through a setting very likely as it requires browser archiving to be disabled
  • log_action would quite certainly not have a
  • Whenever archiver starts, we copy all most recent data from log_visit to log_visit_cold using transactions so we can have fast bulk inserts

Pros

  • Fast writing during tracking making tracking quite a bit faster
  • When we have schema updates on log tables, these will run a lot faster on the log table that have the last 3 days of data meaning we can apply log table changes a LOT faster without much impacting tracking.
  • Live querries can leverage the small log table

Cons

  • Needing to shift data around
  • Especially when we start using it need to exactly know what data for what date range is in which table available
  • Archiving segments for older data still slow
  • Deleting data from cold table may be a bit slow
  • I suppose we would need to run optimize table regularly as we have a lot of inserts and deletes there to save on storage but am not sure? probably not... the optimize table would block the tracking though and cause major issues

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

  • Similar to above we move all recent data before the archiving from a hot table into log_visit etc
  • Then we have for each day of the last 3-4 (or 7) days one log table like log_visit_2019_08_09

Pro

  • This solves the problem of slow deletes because we can simply drop the entire table, and it solves the problem that for sure we don't need to run optimize table on this ever.
  • Makes archiving even faster
  • Easy to know which data is available in which table... like we can use this solution pretty much right away from enabling it (more or less)

Cons

  • We will still need to see with log_action and when moving entries around that they have in the end the same idVisit as they should have.... not sure we can solve this easily

Log table for each month, say log_visit_2019_12

Similar to the archive tables. As MySQL permits up to 4 billion tables could even do this per day.

Pros

  • Archiving segments for older data also bit faster

Cons

  • Lots of unions needed when calculating unique visitors for a year or date ranges
  • Might need to split the initial data into all these tables which could take a while
  • Still 30 days of data in one log table where it could only have eg 3 days of data but that be still fastish

MySQL views with temporary tables...

  • Would need to create them on demand... they would be deleted the connection ends and then need to be recreated...
  • Temp table is only available to the user that creates it... would be only something to improve archiving time
  • Would not improve anything on tracking etc
  • Likely taking too long to create some views
  • Could maybe partially even create views with the segment already applied? Not sure... eg if we change the meaning of Segments to do a idVisit IN (segmen queriy) then this could be actually a game changer
  • Heaps of I/O I suppose... and heaps of temporary tables => storage
  • Would also make archiving of dates in the past faster
  • Creating a temp table with 1M rows took about 30s on a server which is not too bad ====> However, we actually only need to create a temp table with the idVisit so this will make it much faster
  • Temp table does not have any indexes... so not sure how that affects things
  • We could use this technology right away without any complications
  • Wouldn't really help with unique visitor queries for week, month, ... though
  • Will help though a lot with segments as we apply segment only once
  • Easy to develop likely
  • Fixes the segment issue which would be otherwise hard to fix

General

  • Need to see that eg GDPR manager also deletes data from all tables
  • If we didn't have the unique visitors for weeks, months, ... everything be a lot easier and we could even just use a log table per day or week or ...
  • Fast schema updates without much down time be a big win
  • log_action etc need to see that IDs match up
  • If we delete data from a hot table, and an archiver has been running for a long time like > 1 day, we would need to be careful that we don't delete data which is being archived currently but that seems very much edge case and it be safe when we have say last 3 days of data... any only use it to archive last 2 days...

@tsteur
Copy link
Member

tsteur commented Aug 14, 2019

Before implementing the hot / cold tables we want to validate if it makes archiving actually faster. Eg the hot table would store data for the last 3 days, then the question is does it become faster to archive today and yesterday when using the hot tables or not?

It's clear that a hot table will make tracking faster since there are will be less indexes and we can have different index on hot vs cold table. The inserts and updates be faster for sure.

Another advantage of the hot table is, that schema updates on the hot table will be much faster and basically barely affect tracking as there will be say only 3 days worth of data instead of 1,2,3,4 or 5 years or more data. So making schema changes be a lot faster. Updating the schema on the cold table could run in the background and we would simply wait copying data over from hot to cold until that schema update is done.

Tests show it's also really fast to copy data over from hot to cold table.

To test archiving performance we could do things like this:

CREATE TABLE log_link_visit_action_hot LIKE log_link_visit_action;
CREATE TABLE log_visit_hot LIKE log_visit

INSERT log_visit_hot SELECT * FROM log_visit where idsite in(...) and visit_last_action_time >= '2019-08-12 00:00:00';
INSERT log_link_visit_action_hot SELECT * FROM log_link_visit_action where idsite in(...) and server_time >= '2019-08-12 00:00:00';


select SQL_NO_CACHE count(*) from log_visit_hot left join log_link_visit_action_hot on log_visit_hot.idvisit = log_link_visit_action_hot.idvisit where log_visit_hot.idsite= 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;


select SQL_NO_CACHE count(*) from log_visit left join log_link_visit_action on log_visit.idvisit = log_link_visit_action.idvisit where log_visit.idsite = 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;

-- and at the end:
drop table log_visit_hot;
drop table log_link_visit_action_hot;

@siva538
Copy link
Contributor

siva538 commented Aug 18, 2019

+1.

Current challenge is that wherever we wanted to retrieve the detailed transaction data like segmented visitor log for a given "event action/category/name" etc, the system has to spin through entire 100's of Millions of records on log_link_visit_action table, and it is quite a burden on the MYSQL to spin through all of them (one by one).

Instead, having multiple copies (probably partition by visit action date) or some other attribute makes the database maintenance as well much easier, and overall system faster.

Regds,
Sivakumar

@mattab
Copy link
Member Author

mattab commented Oct 22, 2019

This improvement has a lot of potential to reduce IO used by Matomo!

MySQL views with temporary tables...
Now that we implemented the faster archiving in #14761 this solution above is not applicable anymore i suppose?

Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...

This solution sounds the most optimal so far! 🚀

the question is does it become faster to archive today and yesterday when using the hot tables or not?

Be interesting to measure for sure 👍 reckon it will make archiving a bit faster as less index lookups, less read IOs because of smaller tables to join...

Fast schema updates without much down time be a big win

Yes! also no need for Optimise tables would be big win as it's caused issues in the past.

@tsteur
Copy link
Member

tsteur commented Oct 22, 2019

the question is does it become faster to archive today and yesterday when using the hot tables or not?

Be interesting to measure for sure 👍 reckon it will make archiving a bit faster as less index lookups, less read IOs because of smaller tables to join...

I did that and it wasn't making it much faster AFAIK.

Also I think the "IO problem" can't really be solved by this so much since the IO is mostly related to tracking and it will quite likely approx be the same IO no matter if it tracks into a big table or smaller table per month or day... Maybe the index update on each tracking request be a bit smaller but it wouldn't save that much on IO I suppose.

@tsteur
Copy link
Member

tsteur commented Oct 26, 2019

What makes the hot/cold tables very hard / not worth to implement are the site timezones btw. This means we need to rewrite archivers to use unions in SQL queries. No matter if the data is sharded by day, week, month... Even in tracking we sometimes would need to look at multiple log tables for different months/weeks to find a previous visit. For live queries it means executing eg one query per month which can be slowish too. It's not really doable I reckon.

@tsteur
Copy link
Member

tsteur commented Dec 16, 2019

Can we close this issue see last comment?

@mattab
Copy link
Member Author

mattab commented Jan 8, 2020

@tsteur 👍 we can close it and always re-open later if we decide it's worth it again.

@tsteur tsteur closed this as completed Jan 8, 2020
@mattab mattab added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label Jan 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.
Projects
None yet
Development

No branches or pull requests

3 participants