@alexey-milovidov opened this Issue on June 20th 2021

Summary

ClickHouse is the most natural choice for web analytics data.
This is distributed open source analytic DBMS with the focus on maximum query execution speed and storage efficiency.

It was initially created for Yandex Metrica - 2nd web analytics in the world processing ~100 billion records per day, 133 PiB of data in 120 trillion records in total.

Now ClickHouse used in countless applications including web analytics:

  • CloudFlare Web Analytics;
  • Plausible Analytics;
  • Microsoft Clarity;
  • Appsflyer;
  • Segment;
  • OWOX;

https://clickhouse.tech/docs/en/introduction/adopters/

ClickHouse can be easily installed, starting from single node and scale up to thousands of nodes.
It can be easily embedded into self-hosted products. Examples: Sentry, PMM, etc...

@Findus23 commented on June 20th 2021 Member

Hi,

There is already a bit of discussion about clickhouse here: https://github.com/matomo-org/matomo/issues/2592 and https://github.com/matomo-org/matomo/issues/7526

There are two major issues I see:

  • I am pretty sure the majority of Matomo users are not able to install Click house themselves or using Matomo in a setup where clickhouse is available
  • As clickhouse works fundamentally different to MySQL one would need to rewrite major parts of Matomo to be able to use it (which would be even more work than https://github.com/matomo-org/matomo/issues/500, so the arguments from there also remain). Even worse the existing methods would need to be kept the same and also maintained going forward due to above point.

The only way I can see a clickhouse integration even remotely possible in Matomo at the moment is by adding it as an optional plugin in Matomo that replaces one specific slow part of Matomo.

@tsteur commented on June 20th 2021 Member

We looked into it a while ago and it does under circumstances work quite differently. AFAIK there were for example these two points:

  • For these kind of DBs inserts of new records should usually be done in bulk for greatest performance and not each request individually how it happens currently
  • Updates/deletes are usually a problem meaning slow or not possible. This be mostly relevant for when users want to delete tracked data after a certain amount of time.

It would mostly mean a rewrite of the tracking and archiving (report generation) part for ClickHouse

Powered by GitHub Issue Mirror