@diosmosis opened this Issue on September 14th 2014 Member

HyperLogLog is an algorithm that provides 97% accuracy on cardinality counts (ie, unique visitors), while using a very small memory footprint. We could add support for this algorithm within Piwik to not only provide fast cardinality for large numbers of visits, but also we can use it to provide a unique visitor count for any period. HyperLogLog results can themselves be aggregated, so they can be stored in archive tables as blobs.

More info:

@mattab commented on September 17th 2014 Member

I also heard of Bloom filters doing good job: https://en.wikipedia.org/wiki/Bloom_filter

@diosmosis commented on September 17th 2014 Member

According to the second article, bloom filters will take up more space (though provide more accuracy) and the results cannot be merged so the process cannot be done in parallel across several machines.

@mattab commented on March 24th 2015 Member

btw there is now HyperLogLog data structure in Redis http://antirez.com/news/75

update 2021

simple explanation of how it works https://stackoverflow.com/a/12734343/3759928

also from https://segment.com/blog/scaling-up-reporting-on-high-cardinality-metrics/

MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS.
PostgreSQL on the other hand, has an extension called postgresql-hll, which is available on PostgresSQL RDS.

@diosmosis commented on September 25th 2015 Member

Apparently possible to implement hyperloglog in pure SQL: https://www.periscope.io/blog/hyperloglog-in-pure-sql.html

Powered by GitHub Issue Mirror