In order to improve overall performance of Matomo (both at Tracker and Reports processing) an idea we have is to split log_* Tracker tables into two sets of tables: hot data VS cold data.
eg hot data would be data for the last 48 hours or so. Cold data would be any data older than 48 hours. As a result the "hot data" table could be 100 times smaller than the cold data (depending on data retention settings see https://matomo.org/docs/managing-your-databases-size/ )
Having tables split in this way means that:
There will be a few changes needed eg. need a process to move data from hot to cold, need to update SQL query generators to query the right tables as applicable (and possibly getting data from the two tables eg. Visitor log requests over multiple days), etc.
It'll be also beneficial for when there are DB updates on log tables.
Before working on this or as part of this issue may be good to see if horizontal MySQL partitioning can help like
PARTITION BY HASH( MONTH(visit_last_action_time) ).
I think in most queries where performance matters we already do mention the
server_time but that would need to be checked.
It would be great to maybe spend 2-4 days on this project and see what is possible to achieve. Making this change in the database would greatly increase performance of most SQL queries executed by Matomo. (also results in less CPUs usage overall/ more efficient.)
BTW: MySQL Parititions would have been great. Like we could have done
PARTITION BY HASH(YEAR(time) Month(time)) PARTITIONS 24; which would have put data for each month in a separate table eg when using a data retention of 24 months. But, we cannot use any of the time columns for partitions since that column would need to be present in all unique indexes including the primary key.
I can think here of many possible solutions. Many of the solutions have quite a few downsides or are not easy to develop. Only mentioning some therefore. And partitions is not really an option as we've already seen. Two of them I would like to mention in more detail.
Similar to the archive tables. As MySQL permits up to 4 billion tables could even do this per day.
idVisit IN (segmen queriy)then this could be actually a game changer
Before implementing the hot / cold tables we want to validate if it makes archiving actually faster. Eg the hot table would store data for the last 3 days, then the question is does it become faster to archive today and yesterday when using the hot tables or not?
It's clear that a hot table will make tracking faster since there are will be less indexes and we can have different index on hot vs cold table. The inserts and updates be faster for sure.
Another advantage of the hot table is, that schema updates on the hot table will be much faster and basically barely affect tracking as there will be say only 3 days worth of data instead of 1,2,3,4 or 5 years or more data. So making schema changes be a lot faster. Updating the schema on the cold table could run in the background and we would simply wait copying data over from hot to cold until that schema update is done.
Tests show it's also really fast to copy data over from hot to cold table.
To test archiving performance we could do things like this:
CREATE TABLE log_link_visit_action_hot LIKE log_link_visit_action; CREATE TABLE log_visit_hot LIKE log_visit INSERT log_visit_hot SELECT * FROM log_visit where idsite in(...) and visit_last_action_time >= '2019-08-12 00:00:00'; INSERT log_link_visit_action_hot SELECT * FROM log_link_visit_action where idsite in(...) and server_time >= '2019-08-12 00:00:00'; select SQL_NO_CACHE count(*) from log_visit_hot left join log_link_visit_action_hot on log_visit_hot.idvisit = log_link_visit_action_hot.idvisit where log_visit_hot.idsite= 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10; select SQL_NO_CACHE count(*) from log_visit left join log_link_visit_action on log_visit.idvisit = log_link_visit_action.idvisit where log_visit.idsite = 12 and visit_last_action_time >= '2019-08-13 00:00:00' and visit_last_action_time >= '2019-08-14 00:00:00' and time_spent_ref_action > 10; -- and at the end: drop table log_visit_hot; drop table log_link_visit_action_hot;
Current challenge is that wherever we wanted to retrieve the detailed transaction data like segmented visitor log for a given "event action/category/name" etc, the system has to spin through entire 100's of Millions of records on log_link_visit_action table, and it is quite a burden on the MYSQL to spin through all of them (one by one).
Instead, having multiple copies (probably partition by visit action date) or some other attribute makes the database maintenance as well much easier, and overall system faster.