Support for ClickHouse as a storage backend #17697

alexey-milovidov · 2021-06-20T01:19:50Z

Summary

ClickHouse is the most natural choice for web analytics data.
This is distributed open source analytic DBMS with the focus on maximum query execution speed and storage efficiency.

It was initially created for Yandex Metrica - 2nd web analytics in the world processing ~100 billion records per day, 133 PiB of data in 120 trillion records in total.

Now ClickHouse used in countless applications including web analytics:

CloudFlare Web Analytics;
Plausible Analytics;
Microsoft Clarity;
Appsflyer;
Segment;
OWOX;

https://clickhouse.tech/docs/en/introduction/adopters/

ClickHouse can be easily installed, starting from single node and scale up to thousands of nodes.
It can be easily embedded into self-hosted products. Examples: Sentry, PMM, etc...

Findus23 · 2021-06-20T08:30:09Z

Hi,

There is already a bit of discussion about clickhouse here: #2592 and #7526

There are two major issues I see:

I am pretty sure the majority of Matomo users are not able to install Click house themselves or using Matomo in a setup where clickhouse is available
As clickhouse works fundamentally different to MySQL one would need to rewrite major parts of Matomo to be able to use it (which would be even more work than Provide Postgresql support for Piwik #500, so the arguments from there also remain). Even worse the existing methods would need to be kept the same and also maintained going forward due to above point.

The only way I can see a clickhouse integration even remotely possible in Matomo at the moment is by adding it as an optional plugin in Matomo that replaces one specific slow part of Matomo.

tsteur · 2021-06-20T20:16:41Z

We looked into it a while ago and it does under circumstances work quite differently. AFAIK there were for example these two points:

For these kind of DBs inserts of new records should usually be done in bulk for greatest performance and not each request individually how it happens currently
Updates/deletes are usually a problem meaning slow or not possible. This be mostly relevant for when users want to delete tracked data after a certain amount of time.

It would mostly mean a rewrite of the tracking and archiving (report generation) part for ClickHouse

MartinArens · 2021-08-15T10:48:16Z

+1

Clickhouse Insert Performance https://www.percona.com/blog/2020/07/27/clickhouse-and-columnstore-in-the-star-schema-benchmark/

tsteur · 2021-11-15T20:23:13Z

from #18318 by @RoyBellingan see #18318 (comment)

I think you are missing an interesting option, keep using mysql for ALL except the SLOW query...
I am working at the moment on speeding up the following query, that in my use case are the one that takes the majority of time which are like

https://gist.github.com/RoyBellingan/4391287fa5b88e97f7b79062c1b92274

So the table log_visit can ONLY be updated if visit_last_action_time is not older than SESSION_TIME (default of 30 minutes)
and of course log_action is append only (PLEASE correct me if I am wrong, but I logged the sql in mysql and I have never seen been done in other cases (also looking at the index, really looks like the case))

I just have to replicate those 2... for log_action is just a matter max(id), the other one is more involved, I already connected clickhouse to mysql https://clickhouse.com/docs/en/sql-reference/table-functions/mysql/ and I am already insert into mysql from clickhouse in other project.

Of course this is "pro user" thing only, and only if you are having massive amount of traffic...

I will update once I have a few more number and example ready to be analyzed.

@RoyBellingan be great to keep us updated how you go there and potentially also how you set it all up etc. If you manage to make this work I'd be happy to talk to you or exchange some mails etc.

heurteph-ei · 2021-11-16T13:51:56Z

Just a point to think about:
One of Matomo competitors (piwik pro) argument against Matomo is performance consideration: https://piwikpro.fr/blog/problemes-de-performance-de-matomo-les-causes-et-solutions-possibles/ (sorry it is in French)
They mentioned that using CH instead of MySQL helped them in enhance perf...

RoyBellingan · 2021-11-16T19:48:44Z

@heurteph-ei I can easily agree with them.
https://piwik.pro/blog/piwik-pro-clickhouse-faster-efficient-reports/

To make all of this happen, we’ve implemented an ETL process that every few minutes retrieves a new batch of data from MySQL, processes it asynchronously in smaller chunks and then loads the results to ClickHouse.

Which is the same as we do in other project ...

tsteur · 2021-11-16T19:51:03Z

@RoyBellingan how does this affect deleted raw data? Maybe this doesn't apply to you. Like if someone uses the log data retention feature for privacy regulations or if someone deletes data for a recorded visit because of GDPR deletion request etc?

RoyBellingan · 2021-11-16T20:24:15Z

@tsteur uhm, well this will require a more polished syncronization script.

Ch side does not really change much, maybe do not pass only the primary id (Ch has no concept of that), but also the date when happened (normal clustering key for timeseries).

Piwik I think should be a matter to execute a select id, date where xxx before doing delete where xxx

RoyBellingan · 2021-11-25T11:57:39Z

I will soon try out https://clickhouse.com/docs/en/engines/database-engines/materialized-mysql/
Is still marked as experimental, but I saw several blog post around where looks like is working fine.

carlgrundberg · 2021-12-14T10:53:42Z

@RoyBellingan Really interested in your experience with materialized mysql in CH.
We use CH for our own data in combination with Matomo, and right now we have some custom scripts moving the data from mysql to CH, but would be much nicer with a generic solution and to speed up the Matomo interface.

tsteur · 2021-12-17T06:36:26Z

Just reading https://pingcap.com/case-studies/8x-system-performance-boost-why-we-migrated-from-mysql-to-newsql-database

Anyone tried TiDB with Matomo before maybe? Might be MySQL compatible

GitHub: https://github.com/pingcap/tidb

Seeing on https://docs.pingcap.com/tidb/v4.0/mysql-compatibility not supported are

temporary tables (can be worked around using the enable_segments_cache=0 setting)
CREATE TABLE tblName AS SELECT stmt not sure if we use it, but if so it can be also disabled with enable_segments_cache=0 setting AFAIK
GET_LOCK and RELEASE_LOCK functions we made use of them again more recently but could easily replace it (actually there's already a branch for this https://github.com/matomo-org/matomo/compare/lockalt?expand=1 )
Few other challenges possibly though re AutoIncrementId if multiple servers are used.
Some minor select differences, not sure they impact us

Seems like this on paper could work without much of a change.

Seems writes may be slower though (on a single server https://www.percona.com/blog/2019/01/24/a-quick-look-into-tidb-performance-on-a-single-server/ )

RoyBellingan · 2021-12-18T01:40:00Z

@tsteur the Idea I got is that TiDb is good at running in parallel but inherently not so efficent, and also is strictly intended to be used for write once, read many times operation.
I would say try concat them ? Maybe Pingcap can do some test as is totally in their interest to expand their product use base ?

tsteur · 2021-12-20T21:16:17Z

@RoyBellingan emailed them. 👍 also went in touch with Timescale

weismannweb · 2023-01-15T04:05:25Z

Wouldn't doctrine make it easier? Totally not used it before but I thought it was an abstraction layer and could use the same abstracted code for differents databases for most of the code base.

Clickhouse Drivers Exist for Doctrine:
https://github.com/search?q=doctrine+clickhouse

Or this:
https://github.com/hisune/log2ck

RoyBellingan · 2023-01-15T14:27:18Z

(ATM the previous work is gone, so I can not really experiment in this side a lot.)

With the new clickhouse, since I think october 2021 is now possible to "spam" query, before you where supposed to aggregate them in batch, else lot of wasted space and bad write speed. So this should help a LOT

The way data need to be written in clickhouse is quite different, you must have ALL the relevant info in the row so many many column. Is basically prejoined.

Finally about the deletition of old log, what if... is just a simple "if you use the deleted log functionality, sorry no click house for you, yet"...

alexey-milovidov added the Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. label Jun 20, 2021

RoyBellingan mentioned this issue Nov 15, 2021

Clickhouse V2 #18318

Closed

innocraft-automation added this to the For Prioritization milestone Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for ClickHouse as a storage backend #17697

Support for ClickHouse as a storage backend #17697

alexey-milovidov commented Jun 20, 2021

Findus23 commented Jun 20, 2021 •

edited

tsteur commented Jun 20, 2021

MartinArens commented Aug 15, 2021

tsteur commented Nov 15, 2021

heurteph-ei commented Nov 16, 2021

RoyBellingan commented Nov 16, 2021

tsteur commented Nov 16, 2021

RoyBellingan commented Nov 16, 2021

RoyBellingan commented Nov 25, 2021

carlgrundberg commented Dec 14, 2021

tsteur commented Dec 17, 2021 •

edited

RoyBellingan commented Dec 18, 2021

tsteur commented Dec 20, 2021 •

edited

weismannweb commented Jan 15, 2023 •

edited

RoyBellingan commented Jan 15, 2023

Support for ClickHouse as a storage backend #17697

Support for ClickHouse as a storage backend #17697

Comments

alexey-milovidov commented Jun 20, 2021

Summary

Findus23 commented Jun 20, 2021 • edited

tsteur commented Jun 20, 2021

MartinArens commented Aug 15, 2021

tsteur commented Nov 15, 2021

heurteph-ei commented Nov 16, 2021

RoyBellingan commented Nov 16, 2021

tsteur commented Nov 16, 2021

RoyBellingan commented Nov 16, 2021

RoyBellingan commented Nov 25, 2021

carlgrundberg commented Dec 14, 2021

tsteur commented Dec 17, 2021 • edited

RoyBellingan commented Dec 18, 2021

tsteur commented Dec 20, 2021 • edited

weismannweb commented Jan 15, 2023 • edited

RoyBellingan commented Jan 15, 2023

Findus23 commented Jun 20, 2021 •

edited

tsteur commented Dec 17, 2021 •

edited

tsteur commented Dec 20, 2021 •

edited

weismannweb commented Jan 15, 2023 •

edited