@tsteur opened this Issue on March 3rd 2021 Member

refs https://github.com/matomo-org/matomo/issues/17050 which we're currently still discussing and looking into.

In any case I'm thinking to fix the segment hash problem hopefully once and for all I think we should introduce a new column to the segment DB table that stores the segment hash that will be used when fetching archives etc. This will also make it future proof if we ever change the segment definition and/or if we change the calculation for hashes etc without needing to update any archives etc.

We would basically need an update script that sets the correct hashes for any existing segment. And then adjust https://github.com/matomo-org/matomo/blob/4.2.1/core/Segment.php#L464 to try and find a matching hash from the segment DB table. We might need to encode / decode a segment definition to see if we find a match in the table.

@diosmosis do you think that would be useful? If so I reckon we could start working on this already as it's independent of a decision for #17050 which will probably still need a while (@mattab and I already quickly talked about it but need more time)

@diosmosis commented on March 4th 2021 Member

We would still need to do the urlencode($definition), just only when adding the column value, I guess? Although segments can also be specified in INI config and I don't know if we ever specified how they should be encoded.

@tsteur commented on March 4th 2021 Member

I don't know if there's any issue with that but would assume these segments should be encoded the same way maybe as usual but because it's not sent along a URL the outer URL encode would not be needed. Generally, if it's a real time segment or a segment from the config then I suppose we'd need to fallback to the regular hash calculation since there be no other choice pretty much.

We would still need to do the urlencode($definition), just only when adding the column value, I guess?

You mean during the update/ migration? I assume we'd need to do the urlencode when adding this column from what I understand to make sure we end up having the correct segment hash value. We could write a little test script that checks if this works for some production DBs. Like we could calculate the hash for each segment in memory, and then check if there are any archive tables with that "done" flag etc to make sure we calculate the hash correctly. Not sure if that would help.

Then I was hoping something like https://github.com/matomo-org/matomo/pull/17029#issuecomment-761913174 would work

@diosmosis commented on March 5th 2021 Member

Generally, if it's a real time segment

If it's a custom segment it won't be in the table. If it's real time, and it's in the table then we'd use the hash (to my understanding). The config ones are weird since no one specified exactly how to specify them. I guess we can check if there's an operator in it that isn't encoded (like, =) and if so, encode, otherwise don't.

You mean during the update/ migration?

Yes and when adding new segments to the table.

This solution seems a bit like taking on tech debt, but it's also very low risk, so it's probably the best short term solution :+1:

@tsteur commented on March 5th 2021 Member

This solution seems a bit like taking on tech debt, but it's also very low risk, so it's probably the best short term solution 👍

I'm thinking it's actually the best long term solution since you want to be able to change the algorithm for how to calculate the hash at any time (eg different segment string or do it based on parsed segment expression). There is of course still the urlencode/urldecode thing but that's a different problem which we eventually will hopefully tackle with #17050 .

@diosmosis commented on March 5th 2021 Member

Well changing the algorithm still means you have to deal w/ old hashes and new hashes, so I'm not sure it will really be aided by a hash table column, where we can put only one value.

@tsteur commented on March 10th 2021 Member

another good benefit is that when the hash is stored in the DB, and we see a query in the processlist, then we can map it more easily to a specific segment https://user-images.githubusercontent.com/273120/110702749-cee7de80-8257-11eb-9ec9-11a7d096fa06.png

@jorgeuos commented on March 15th 2021

Just to make sure, there is no way today to edit a segment definition and keep historical data in the same report?

Gonna follow this issue, because it's a highly requested feature from our users.

Br, Jorge

@tsteur commented on March 22nd 2021 Member

also another great benefit is that it would allow us to join archive_invalidations table and segment table for example (or archive and segment table) in MySQL and directly see which segment for example a scheduled invalidation is, when it was created etc.

@diosmosis commented on March 22nd 2021 Member

That would still require a like query, correct? Or some set of function calls since the name is in the done%.Plugin format. Or you mean going backwards from hash to definition? Yes that's definitely super useful :+1:

@tsteur commented on March 22nd 2021 Member

I haven't thought this through but was hoping that something like join... on concat('done', segment.hash) = name works. For plugin specific archives would maybe also need to add the specific plugin name 👍 never tested a like actually (be great if that worked)

@tsteur commented on March 29th 2021 Member

ideally when this PR is finished we can remove all of the urlencode in segment constructor here: https://github.com/matomo-org/matomo/pull/17029/files

@tsteur commented on April 19th 2021 Member

@flamisz was about to create a new issue for below comment but then realised it may be better to do it as part of this issue as it helps us validate/check if we fully achieved the wanted outcome. We could still close the issue later and create a separate issue for it if there's trouble.

Now that we moved the segment hash into the DB table in https://github.com/matomo-org/matomo/pull/17408 / https://github.com/matomo-org/matomo/issues/17301 it would be great to try and remove all the URL encoded segment constructor calls for the definition that were added in


There are multiple new Segment(urlencode($segmentDefinition)) calls in both PRs and we would expect that we can change this now everywhere to the regular `new Segment($segmentDefinition). This way we can kind of double check the fix for #17301 worked. The goal is for developers to not needing to know whether it needs to be urlencoded or not the segment definition since it's impossible to get this right. Instead we'd want the system (the segment:getHash method) to find the right hash. We would basically expect that after removing these urlencode that the tests still pass.

This Issue was closed on April 21st 2021
Powered by GitHub Issue Mirror