@MichaelHeerklotz opened this Issue on September 10th 2019 Contributor

While doing some research on the log_action table, I discovered that the chance of a collision of a crc32 hash is pretty high:

https://preshing.com/20110504/hash-collision-probabilities/ (see at the bottom "Small Collision Probabilities")

In my setup we have about 182 million rows in log_action which means the chance of a collision is already way higher than 1 out of 2 if I am not mistaken.

So it seems like it is important that sql queries always match on hash and name.
Maybe this information should be added to the documentation somewhere (https://developer.matomo.org/guides/persistence-and-the-mysql-backend)

Maybe it also makes sense to add some info about always querying for type, hash and name at the same time so that the index can be used.

@tsteur commented on September 10th 2019 Member

I think it's more showing that there's eg a 50% chance that you have a collision in general after 80K rows being in there. Which is fine though in this case. It is definitely expected to have a lot of collisions in the table. It should still be quite fast. You can do a group by hash with a count() order by count() desc to get a feel for it how often an individual hash is used max.

This Issue was closed on September 10th 2019
Powered by GitHub Issue Mirror