Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the segment hash in the DB table #17301

Closed
tsteur opened this issue Mar 3, 2021 · 12 comments · Fixed by #17408 or #17475
Closed

Store the segment hash in the DB table #17301

tsteur opened this issue Mar 3, 2021 · 12 comments · Fixed by #17408 or #17475
Assignees
Labels
Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented Mar 3, 2021

refs #17050 which we're currently still discussing and looking into.

In any case I'm thinking to fix the segment hash problem hopefully once and for all I think we should introduce a new column to the segment DB table that stores the segment hash that will be used when fetching archives etc. This will also make it future proof if we ever change the segment definition and/or if we change the calculation for hashes etc without needing to update any archives etc.

We would basically need an update script that sets the correct hashes for any existing segment. And then adjust https://github.com/matomo-org/matomo/blob/4.2.1/core/Segment.php#L464 to try and find a matching hash from the segment DB table. We might need to encode / decode a segment definition to see if we find a match in the table.

@diosmosis do you think that would be useful? If so I reckon we could start working on this already as it's independent of a decision for #17050 which will probably still need a while (@mattab and I already quickly talked about it but need more time)

@tsteur tsteur added Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. RFC Indicates the issue is a request for comments where the author is looking for feedback. labels Mar 3, 2021
@tsteur tsteur added this to the 4.3.0 milestone Mar 3, 2021
@diosmosis
Copy link
Member

We would still need to do the urlencode($definition), just only when adding the column value, I guess? Although segments can also be specified in INI config and I don't know if we ever specified how they should be encoded.

@tsteur
Copy link
Member Author

tsteur commented Mar 4, 2021

I don't know if there's any issue with that but would assume these segments should be encoded the same way maybe as usual but because it's not sent along a URL the outer URL encode would not be needed. Generally, if it's a real time segment or a segment from the config then I suppose we'd need to fallback to the regular hash calculation since there be no other choice pretty much.

We would still need to do the urlencode($definition), just only when adding the column value, I guess?

You mean during the update/ migration? I assume we'd need to do the urlencode when adding this column from what I understand to make sure we end up having the correct segment hash value. We could write a little test script that checks if this works for some production DBs. Like we could calculate the hash for each segment in memory, and then check if there are any archive tables with that "done" flag etc to make sure we calculate the hash correctly. Not sure if that would help.

Then I was hoping something like #17029 (comment) would work

@diosmosis
Copy link
Member

Generally, if it's a real time segment

If it's a custom segment it won't be in the table. If it's real time, and it's in the table then we'd use the hash (to my understanding). The config ones are weird since no one specified exactly how to specify them. I guess we can check if there's an operator in it that isn't encoded (like, =) and if so, encode, otherwise don't.

You mean during the update/ migration?

Yes and when adding new segments to the table.

This solution seems a bit like taking on tech debt, but it's also very low risk, so it's probably the best short term solution 👍

@tsteur
Copy link
Member Author

tsteur commented Mar 5, 2021

This solution seems a bit like taking on tech debt, but it's also very low risk, so it's probably the best short term solution 👍

I'm thinking it's actually the best long term solution since you want to be able to change the algorithm for how to calculate the hash at any time (eg different segment string or do it based on parsed segment expression). There is of course still the urlencode/urldecode thing but that's a different problem which we eventually will hopefully tackle with #17050 .

@diosmosis
Copy link
Member

Well changing the algorithm still means you have to deal w/ old hashes and new hashes, so I'm not sure it will really be aided by a hash table column, where we can put only one value.

@tsteur
Copy link
Member Author

tsteur commented Mar 10, 2021

another good benefit is that when the hash is stored in the DB, and we see a query in the processlist, then we can map it more easily to a specific segment https://user-images.githubusercontent.com/273120/110702749-cee7de80-8257-11eb-9ec9-11a7d096fa06.png

@jorgeuos
Copy link

Just to make sure, there is no way today to edit a segment definition and keep historical data in the same report?

Gonna follow this issue, because it's a highly requested feature from our users.

Br, Jorge

@tsteur tsteur removed the RFC Indicates the issue is a request for comments where the author is looking for feedback. label Mar 18, 2021
@tsteur
Copy link
Member Author

tsteur commented Mar 22, 2021

also another great benefit is that it would allow us to join archive_invalidations table and segment table for example (or archive and segment table) in MySQL and directly see which segment for example a scheduled invalidation is, when it was created etc.

@diosmosis
Copy link
Member

That would still require a like query, correct? Or some set of function calls since the name is in the done%.Plugin format. Or you mean going backwards from hash to definition? Yes that's definitely super useful 👍

@tsteur
Copy link
Member Author

tsteur commented Mar 22, 2021

I haven't thought this through but was hoping that something like join... on concat('done', segment.hash) = name works. For plugin specific archives would maybe also need to add the specific plugin name 👍 never tested a like actually (be great if that worked)

@tsteur
Copy link
Member Author

tsteur commented Mar 29, 2021

ideally when this PR is finished we can remove all of the urlencode in segment constructor here: https://github.com/matomo-org/matomo/pull/17029/files

@flamisz flamisz self-assigned this Mar 30, 2021
@flamisz flamisz mentioned this issue Mar 30, 2021
10 tasks
@tsteur
Copy link
Member Author

tsteur commented Apr 19, 2021

@flamisz was about to create a new issue for below comment but then realised it may be better to do it as part of this issue as it helps us validate/check if we fully achieved the wanted outcome. We could still close the issue later and create a separate issue for it if there's trouble.

Now that we moved the segment hash into the DB table in #17408 / #17301 it would be great to try and remove all the URL encoded segment constructor calls for the definition that were added in

https://github.com/matomo-org/matomo/pull/17029/files#diff-f41c275e883996b8b69cb765b1818e12e82cbe034e648bc3f7d03610cc7abad1R888
https://github.com/matomo-org/matomo/pull/17408/files#diff-2f24f416893e2a39a02896c786a42a45d0fe00cc038d5e4563b162809d10a030R825
https://github.com/matomo-org/matomo/pull/17408/files#diff-62d637ff62761f4da5fb4456c6dd5da10c471c827be5baf64c60664c0ad3c720R578

There are multiple new Segment(urlencode($segmentDefinition)) calls in both PRs and we would expect that we can change this now everywhere to the regular ``new Segment($segmentDefinition). This way we can kind of double check the fix for #17301 worked. The goal is for developers to not needing to know whether it needs to be urlencoded or not the segment definition since it's impossible to get this right. Instead we'd want the system (the segment:getHash method) to find the right hash. We would basically expect that after removing these urlencode` that the tests still pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
4 participants