Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

diosmosis · 2014-09-14T18:42:28Z

HyperLogLog is an algorithm that provides 97% accuracy on cardinality counts (ie, unique visitors), while using a very small memory footprint. We could add support for this algorithm within Piwik to not only provide fast cardinality for large numbers of visits, but also we can use it to provide a unique visitor count for any period. HyperLogLog results can themselves be aggregated, so they can be stored in archive tables as blobs.

More info:

mattab · 2014-09-17T09:08:25Z

I also heard of Bloom filters doing good job: https://en.wikipedia.org/wiki/Bloom_filter

diosmosis · 2014-09-17T15:10:16Z

According to the second article, bloom filters will take up more space (though provide more accuracy) and the results cannot be merged so the process cannot be done in parallel across several machines.

mattab · 2015-03-24T21:34:12Z

btw there is now HyperLogLog data structure in Redis http://antirez.com/news/75

update 2021

simple explanation of how it works https://stackoverflow.com/a/12734343/3759928

also from https://segment.com/blog/scaling-up-reporting-on-high-cardinality-metrics/

MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS.
PostgreSQL on the other hand, has an extension called postgresql-hll, which is available on PostgresSQL RDS.

diosmosis · 2015-09-25T07:20:45Z

Apparently possible to implement hyperloglog in pure SQL: https://www.periscope.io/blog/hyperloglog-in-pure-sql.html

diosmosis · 2021-05-23T15:18:53Z

We recently ran some tests with hyperloglog in matomo to use instead of COUNT(DISTINCT idvisitor) and noticed in every case it was far more efficient.

Details about the test:

The test used hyperloglog in pure SQL run against COUNT(DISTINCT idvisitor) on periods that ranged from having 700 distinct visitors to 50,000,000 distinct visitors. Tests were run on day periods and month periods only.
Hyperloglog is a probablistic counter so it is not as accurate, BUT it allows you to control the error rate. We used a 5% error rate for the test, lower % error rates would require more memory/storage. (For 5% error rate, we need 512 rows in memory, or if we store intermediate data in the archive tables in the database, a datatable w/ 512 rows in a blob row).
We checked how accurate the results were along with how much faster they were.

Test results:

For smaller sets of distinct idvisitors (ranging from 700 to 500,000), hyperloglog was always faster, ranging from 50% faster to 90% faster.
For larger sets of distinct idvisitors (ranging from 14 million to 50 million), hyperloglog was faster, but only around ~20% faster. This may have something to do w/ the overall amount of rows that had to be traversed.
The error rate was usually < 1%, but ranged from 1% to 6.5%. Only one of our test had an error rate > 5%.

Conclusions:

DISTINCT queries are used in many archiving queries. Since HyperLogLog shows performance boosts for every period, including day periods, it's conceivable that using hyperloglog would have a dramatic effect on archiving performance.
HyperLogLog intermediate results can be saved. This means we could store them in the blob table for day periods, and just aggregate them together for higher periods. This means for higher periods, we don't actually have to query the log tables, which would make querying for distinct visitors immediate.
Having a 2% error rate means using 4096 rows in memory or in a datatable. For instances where hyperloglog would be useful, this is likely very doable. 1% means 16384 rows, which might also be doable depending on the user.
For some visit counts/action counts, we don't need to do hyperloglog and could enable it conditionally.
When hyperloglog is used, we would have to change 'Unique Visitors' or 'Unique X' to 'Estimated Unique X' w/ metric documentation that explains what it is. We might also want a setting whether to enable it for only higher periods or also for day periods.

All of this depends on whether large users would want exact numbers or a close enough estimate, but it might make Matomo usable for very high traffic websites.

tsteur · 2021-05-23T20:06:36Z

Great findings @diosmosis 💯

From UI perspective when enabling this the biggest challenge will be making it clear to users when accurate numbers are used and when estimates. So if someone was to enable these estimates, then we'd need to change the metric/column name to "Estimated Unique Visitors" and update the description to mention the estimate and it's error rate etc to prevent users getting confused or reporting errors that aggregating the unique visitors from lower periods isn't the same etc.

mattab added this to the Long term milestone Sep 17, 2014

mattab added Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. c: Performance For when we could improve the performance / speed of Matomo. labels Sep 17, 2014

mattab modified the milestones: Long term, Mid term Dec 23, 2015

mattab added Lower priority and removed Lower priority labels Dec 5, 2016

mattab modified the milestones: Long term, Mid term Dec 5, 2016

mattab removed the Lower priority label Sep 13, 2018

sgiehl mentioned this issue Feb 9, 2024

[Bug] Archiving monthly data very slow at the end of the month #21906

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

diosmosis commented Sep 14, 2014

mattab commented Sep 17, 2014

diosmosis commented Sep 17, 2014

mattab commented Mar 24, 2015 •

edited

diosmosis commented Sep 25, 2015

diosmosis commented May 23, 2021 •

edited

tsteur commented May 23, 2021

Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

Big Data Cardinality: fast unique visitor counting for large instances using HyperLogLog #6212

Comments

diosmosis commented Sep 14, 2014

mattab commented Sep 17, 2014

diosmosis commented Sep 17, 2014

mattab commented Mar 24, 2015 • edited

update 2021

diosmosis commented Sep 25, 2015

diosmosis commented May 23, 2021 • edited

tsteur commented May 23, 2021

mattab commented Mar 24, 2015 •

edited

diosmosis commented May 23, 2021 •

edited