New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ClickHouse as a storage backend #17697
Comments
Hi, There is already a bit of discussion about clickhouse here: #2592 and #7526 There are two major issues I see:
The only way I can see a clickhouse integration even remotely possible in Matomo at the moment is by adding it as an optional plugin in Matomo that replaces one specific slow part of Matomo. |
We looked into it a while ago and it does under circumstances work quite differently. AFAIK there were for example these two points:
It would mostly mean a rewrite of the tracking and archiving (report generation) part for ClickHouse |
+1 Clickhouse Insert Performance https://www.percona.com/blog/2020/07/27/clickhouse-and-columnstore-in-the-star-schema-benchmark/ |
from #18318 by @RoyBellingan see #18318 (comment)
@RoyBellingan be great to keep us updated how you go there and potentially also how you set it all up etc. If you manage to make this work I'd be happy to talk to you or exchange some mails etc. |
Just a point to think about: |
@heurteph-ei I can easily agree with them. To make all of this happen, we’ve implemented an ETL process that every few minutes retrieves a new batch of data from MySQL, processes it asynchronously in smaller chunks and then loads the results to ClickHouse. Which is the same as we do in other project ... |
@RoyBellingan how does this affect deleted raw data? Maybe this doesn't apply to you. Like if someone uses the log data retention feature for privacy regulations or if someone deletes data for a recorded visit because of GDPR deletion request etc? |
@tsteur uhm, well this will require a more polished syncronization script. Ch side does not really change much, maybe do not pass only the primary id (Ch has no concept of that), but also the date when happened (normal clustering key for timeseries). Piwik I think should be a matter to execute a select id, date where xxx before doing delete where xxx |
I will soon try out https://clickhouse.com/docs/en/engines/database-engines/materialized-mysql/ |
@RoyBellingan Really interested in your experience with materialized mysql in CH. |
Just reading https://pingcap.com/case-studies/8x-system-performance-boost-why-we-migrated-from-mysql-to-newsql-database Anyone tried TiDB with Matomo before maybe? Might be MySQL compatible GitHub: https://github.com/pingcap/tidb Seeing on https://docs.pingcap.com/tidb/v4.0/mysql-compatibility not supported are
Seems like this on paper could work without much of a change. Seems writes may be slower though (on a single server https://www.percona.com/blog/2019/01/24/a-quick-look-into-tidb-performance-on-a-single-server/ ) |
@tsteur the Idea I got is that TiDb is good at running in parallel but inherently not so efficent, and also is strictly intended to be used for write once, read many times operation. |
@RoyBellingan emailed them. 👍 also went in touch with Timescale |
Wouldn't doctrine make it easier? Totally not used it before but I thought it was an abstraction layer and could use the same abstracted code for differents databases for most of the code base. Clickhouse Drivers Exist for Doctrine: Or this: |
(ATM the previous work is gone, so I can not really experiment in this side a lot.) With the new clickhouse, since I think october 2021 is now possible to "spam" query, before you where supposed to aggregate them in batch, else lot of wasted space and bad write speed. So this should help a LOT The way data need to be written in clickhouse is quite different, you must have ALL the relevant info in the row so many many column. Is basically prejoined. Finally about the deletition of old log, what if... is just a simple "if you use the deleted log functionality, sorry no click house for you, yet"... |
Summary
ClickHouse is the most natural choice for web analytics data.
This is distributed open source analytic DBMS with the focus on maximum query execution speed and storage efficiency.
It was initially created for Yandex Metrica - 2nd web analytics in the world processing ~100 billion records per day, 133 PiB of data in 120 trillion records in total.
Now ClickHouse used in countless applications including web analytics:
https://clickhouse.tech/docs/en/introduction/adopters/
ClickHouse can be easily installed, starting from single node and scale up to thousands of nodes.
It can be easily embedded into self-hosted products. Examples: Sentry, PMM, etc...
The text was updated successfully, but these errors were encountered: