ClickHouse is the most natural choice for web analytics data.
This is distributed open source analytic DBMS with the focus on maximum query execution speed and storage efficiency.
It was initially created for Yandex Metrica - 2nd web analytics in the world processing ~100 billion records per day, 133 PiB of data in 120 trillion records in total.
Now ClickHouse used in countless applications including web analytics:
ClickHouse can be easily installed, starting from single node and scale up to thousands of nodes.
It can be easily embedded into self-hosted products. Examples: Sentry, PMM, etc...
There are two major issues I see:
The only way I can see a clickhouse integration even remotely possible in Matomo at the moment is by adding it as an optional plugin in Matomo that replaces one specific slow part of Matomo.
We looked into it a while ago and it does under circumstances work quite differently. AFAIK there were for example these two points:
It would mostly mean a rewrite of the tracking and archiving (report generation) part for ClickHouse
Clickhouse Insert Performance https://www.percona.com/blog/2020/07/27/clickhouse-and-columnstore-in-the-star-schema-benchmark/
I think you are missing an interesting option, keep using mysql for ALL except the SLOW query...
I am working at the moment on speeding up the following query, that in my use case are the one that takes the majority of time which are like
So the table log_visit can ONLY be updated if visit_last_action_time is not older than SESSION_TIME (default of 30 minutes)
and of course log_action is append only (PLEASE correct me if I am wrong, but I logged the sql in mysql and I have never seen been done in other cases (also looking at the index, really looks like the case))
I just have to replicate those 2... for log_action is just a matter max(id), the other one is more involved, I already connected clickhouse to mysql https://clickhouse.com/docs/en/sql-reference/table-functions/mysql/ and I am already insert into mysql from clickhouse in other project.
Of course this is "pro user" thing only, and only if you are having massive amount of traffic...
I will update once I have a few more number and example ready to be analyzed.
@RoyBellingan be great to keep us updated how you go there and potentially also how you set it all up etc. If you manage to make this work I'd be happy to talk to you or exchange some mails etc.
Just a point to think about:
One of Matomo competitors (piwik pro) argument against Matomo is performance consideration: https://piwikpro.fr/blog/problemes-de-performance-de-matomo-les-causes-et-solutions-possibles/ (sorry it is in French)
They mentioned that using CH instead of MySQL helped them in enhance perf...
@heurteph-ei I can easily agree with them.
To make all of this happen, we’ve implemented an ETL process that every few minutes retrieves a new batch of data from MySQL, processes it asynchronously in smaller chunks and then loads the results to ClickHouse.
Which is the same as we do in other project ...
@RoyBellingan how does this affect deleted raw data? Maybe this doesn't apply to you. Like if someone uses the log data retention feature for privacy regulations or if someone deletes data for a recorded visit because of GDPR deletion request etc?
@tsteur uhm, well this will require a more polished syncronization script.
Ch side does not really change much, maybe do not pass only the primary id (Ch has no concept of that), but also the date when happened (normal clustering key for timeseries).
Piwik I think should be a matter to execute a select id, date where xxx before doing delete where xxx
I will soon try out https://clickhouse.com/docs/en/engines/database-engines/materialized-mysql/
Is still marked as experimental, but I saw several blog post around where looks like is working fine.