New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB: Split log_* Tracker tables into hot/cold for faster tracking API requests & reports archiving #14119
Comments
It'll be also beneficial for when there are DB updates on log tables. |
Before working on this or as part of this issue may be good to see if horizontal MySQL partitioning can help like |
It would be great to maybe spend 2-4 days on this project and see what is possible to achieve. Making this change in the database would greatly increase performance of most SQL queries executed by Matomo. (also results in less CPUs usage overall/ more efficient.) |
BTW: MySQL Parititions would have been great. Like we could have done |
I can think here of many possible solutions. Many of the solutions have quite a few downsides or are not easy to develop. Only mentioning some therefore. And partitions is not really an option as we've already seen. Two of them I would like to mention in more detail. Writing tracking requests into two tables eg log_visit and log_visit_cold (and similar for other log tables)
Pros
Cons
Writing tracking requests into only one table eg log_visit and data is moved periodically to a log_visit_cold table (and similar for other log tables)
Pros
Cons
Variation of this: We have the regular log_visit table in which we move into all tables... then we have 3-4 tables for each day...
Pro
Cons
Log table for each month, say log_visit_2019_12Similar to the archive tables. As MySQL permits up to 4 billion tables could even do this per day. Pros
Cons
MySQL views with temporary tables...
General
|
Before implementing the hot / cold tables we want to validate if it makes archiving actually faster. Eg the hot table would store data for the last 3 days, then the question is does it become faster to archive today and yesterday when using the hot tables or not? It's clear that a hot table will make tracking faster since there are will be less indexes and we can have different index on hot vs cold table. The inserts and updates be faster for sure. Another advantage of the hot table is, that schema updates on the hot table will be much faster and basically barely affect tracking as there will be say only 3 days worth of data instead of 1,2,3,4 or 5 years or more data. So making schema changes be a lot faster. Updating the schema on the cold table could run in the background and we would simply wait copying data over from hot to cold until that schema update is done. Tests show it's also really fast to copy data over from hot to cold table. To test archiving performance we could do things like this: CREATE TABLE log_link_visit_action_hot LIKE log_link_visit_action;
CREATE TABLE log_visit_hot LIKE log_visit
INSERT log_visit_hot SELECT * FROM log_visit where idsite in(...) and visit_last_action_time >= '2019-08-12 00:00:00';
INSERT log_link_visit_action_hot SELECT * FROM log_link_visit_action where idsite in(...) and server_time >= '2019-08-12 00:00:00';
select SQL_NO_CACHE count(*) from log_visit_hot left join log_link_visit_action_hot on log_visit_hot.idvisit = log_link_visit_action_hot.idvisit where log_visit_hot.idsite= 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;
select SQL_NO_CACHE count(*) from log_visit left join log_link_visit_action on log_visit.idvisit = log_link_visit_action.idvisit where log_visit.idsite = 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10;
-- and at the end:
drop table log_visit_hot;
drop table log_link_visit_action_hot; |
+1. Current challenge is that wherever we wanted to retrieve the detailed transaction data like segmented visitor log for a given "event action/category/name" etc, the system has to spin through entire 100's of Millions of records on log_link_visit_action table, and it is quite a burden on the MYSQL to spin through all of them (one by one). Instead, having multiple copies (probably partition by visit action date) or some other attribute makes the database maintenance as well much easier, and overall system faster. Regds, |
This improvement has a lot of potential to reduce IO used by Matomo!
This solution sounds the most optimal so far! 🚀
Be interesting to measure for sure 👍 reckon it will make archiving a bit faster as less index lookups, less read IOs because of smaller tables to join...
Yes! also no need for Optimise tables would be big win as it's caused issues in the past. |
I did that and it wasn't making it much faster AFAIK. Also I think the "IO problem" can't really be solved by this so much since the IO is mostly related to tracking and it will quite likely approx be the same IO no matter if it tracks into a big table or smaller table per month or day... Maybe the index update on each tracking request be a bit smaller but it wouldn't save that much on IO I suppose. |
What makes the hot/cold tables very hard / not worth to implement are the site timezones btw. This means we need to rewrite archivers to use unions in SQL queries. No matter if the data is sharded by day, week, month... Even in tracking we sometimes would need to look at multiple log tables for different months/weeks to find a previous visit. For live queries it means executing eg one query per month which can be slowish too. It's not really doable I reckon. |
Can we close this issue see last comment? |
@tsteur 👍 we can close it and always re-open later if we decide it's worth it again. |
In order to improve overall performance of Matomo (both at Tracker and Reports processing) an idea we have is to split log_* Tracker tables into two sets of tables: hot data VS cold data.
eg hot data would be data for the last 48 hours or so. Cold data would be any data older than 48 hours. As a result the "hot data" table could be 100 times smaller than the cold data (depending on data retention settings see https://matomo.org/docs/managing-your-databases-size/ )
Having tables split in this way means that:
There will be a few changes needed eg. need a process to move data from hot to cold, need to update SQL query generators to query the right tables as applicable (and possibly getting data from the two tables eg. Visitor log requests over multiple days), etc.
The text was updated successfully, but these errors were encountered: