New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise segment encoding strategy #17050
Comments
Refs #17029 |
@diosmosis what be your time estimate for making this work? I suppose we can't migrate existing segments in the DB because we don't know how often they were encoded? As we still have to support old triple urlencoded segments, what would be the plan to get rid of this so we can benefit from a simpler solution? AKA would we basically simply say in 3-4 years time we no longer support the old way of such definitions? Because people delete their raw data but keep historical data we might lose reports eventually. |
I'd say 1-2 days including automated tests. It changes should only be in a couple places.
I think the biggest issue w/ archive table segment hashes is that we don't always know what the segments are. Not every segment used will be in the
Only thing I can think of would be to migrate on demand until matomo 5 then say if it's not migrated already it won't be. Or... we could have an update that does the migration for segments in the What do you think? |
@diosmosis wondering: I understand the change would be mostly to fix the hash problem? If that's the case, could we in future maybe somehow store the hash in the segment table and then detect the correct segment definition from the Just wondering if this would be a solution cause if we introduced a new way, then at least for 2-3 years we'd still need to support the old way of segment definitions until we introduce Matomo 5 and can break API. From that point we'd probably still want to support previously generated segment definitions for 2-3 years so we probably have to always support the old way for at least 5 years as from what I see we maybe can't migrate existing segment definitions. I assume there's still though the problem that users might not encode the value in a condition etc. So I don't know if it's possible but eventually say from Matomo 5 could maybe check if we can enforce correct encoding from that moment. I'm not too much into it though so just checking what is possible and what the root problems are etc. |
I think I understand what you're talking about and I think it might work, but it seems like adding more complexity to an already too complicated solution.
Well "support the old way" for me is just "redirect to new segment encoding URL if old encoding is found" and "include the old hash when selecting archives", so I don't think it's that much extra work. Assuming I'm not forgetting anything. |
@diosmosis can you show how some examples would look like eg for page URLs with various special characters vs what they currently need to look like? They would still need to be encoded right but of course less often maybe. I suppose explaining developers to replace eg % with $ might also not be trivial.
Not sure how much complexity this adds (to me this sounds bit more complicated but hard to say). I suppose sometimes even just adding a redirect could technically break some APIs and cause breaking things so we would need to be careful there and avoid redirects but convert it internally. |
readable segment (just for illustration): segment w/ new encoding outside of URL: Note: doing it this way means we don't have to worry about how often we To detect for old encoding, we check if there is a
Don't have to redirect, could just change the value in |
@diosmosis can you also quickly show what this would look like in PHP to create such a segment definition and what the documentation for users would say on how such a definition would need to be created? That's often a good indicator how easy it'll be for them |
@tsteur for this specific encoding would be like:
I guess the str_replace could be seen as more complicated. If there's a builtin encode function that's better we could use that as well. We could also allow quotes or something like that, eg:
Not sure if this would be better or worse. |
So it be like echo urlencode('pageUrl=@' . str_replace('%', '$', urlencode('http://serenity.org/docs/javascript-tracking/?x=w$5%6l435k343&+foo')) . ',' . 'userId=@' . str_replace('%', '$', urlencode('foobarbayz"w$5%6l435k343&+foo')));
// or maybe better `rawurlencode`. It would look like this when we receive the request:
And to know whether someone actually encoded the value we would look if there is a dollar character? Like In the documentation we would need to explain it like this I suppose
I'm not sure yet this is really a simpler strategy and simpler code? Also can't really see how it's more readable compared to just: echo urlencode('pageUrl=@' . rawurlencode('http://serenity.org/docs/javascript-tracking/?x=w$5%6l435k343&+foo') . ',' . 'userId=@' . rawurlencode('foobarbayz"w$5%6l435k343&+foo')); without the
See above that might be still the case when they use the I might be seeing it wrong but this might not be the solution yet. If we change it, maybe might need to think about using some JSON or so but this can result again maybe in different errors not sure. Random thought could be also in the future to have some kind of versioning in there or so. Like "this is segment definition version v2" or something. Not sure it would help (maybe not) but maybe it could for the future to abstract some things this way and be able to change it more easily. Might not be needed... I suppose what would make the code base a lot simpler is the segment hash issue and avoids having random issues. This we could maybe solve also differently. |
Not quite. We only check for the old encoding to convert. So it'd be something like:
The value actually has to be encoded twice, and then the entire segment encoded once. Encoding the value once, then encoding a different way for the URL should be simpler. The URL value will still look complicated, but we don't care about that in the code (ie, we no longer do Request::getRawUrlSegmentValue() or whatever).
If a user makes a mistake in manually constructing a segment, There could be other ways of encoding the values. And we could also only require it for special operator characters (ie, characters used in other parts of segment condition, like
This is what GA does and it would work (it's basically sending the segment parsed), but it's rather different from our normal segment definitions. Wasn't sure if that would be acceptable. If thinking about alternative solutions, I can think of the following changes:
This still means we have to support the old triple encoded values though, not sure if there's more chance of regressions here. If we only do the first and second it might work too. EDIT: Though even w/ this it won't prevent segment encoding issues, as I remember there was also an issue in the UI for segments where the segment value had |
@diosmosis I'll need to read your comment properly tomorrow or in 2 days. Without thinking too much I wouldn't see an issue with changing the structure completely so things make a lot more sense instead of maybe ending up with similar issues and it would clearly let us know which variation / version is used etc. It's probably unrelated but in future we might also support filters ( #15438 ) and maybe even sequences (where a visitor needs to match a sequence of actions instead of just one action) although for performance reasons we might not be able to. Anyway. A different format might also help with other features in the future although this shouldn't influence things too much because we might not work on them and who knows what things will look like then. Just thought I mention it anyway. Re GA is this the right doc? https://developers.google.com/analytics/devguides/reporting/core/v3/segments |
@tsteur i think https://developers.google.com/analytics/devguides/reporting/core/v4/migration#v4_8 is it (see the v4 version of API request payloads) |
@diosmosis random thought... what if at least the UI and the archiving was to send the segmentId instead of the segment definition? Like Generally also noticing https://developer.matomo.org/api-reference/reporting-api-segmentation#segment-values-must-be-url-encoded barely mentions anything re encoding so not surprised there are overall issues. Of course the current problems are mostly also internal. We could maybe use some JSON definition like encodeURIComponent(JSON.stringify([
[{dimension: "pageurl", operator: "equals", match: "test1"},
{dimension: "pageurl", operator: "equals", match: "OR (because within the same array we use OR) test2"}],
[{dimension: "pageurl", operator: "equals", match: "AND (because new array we use and here) test3"},
{dimension: "pageurl", operator: "equals", match: "OR test4"}]
])) The downside be though that it be a bit more complicated to explain in the developer docs maybe and the URL gets longer if used in a GET (and could get too long for some webserver/firewall configs). We would need to always POST the segment. Alternatively, maybe we could build the hash based on the parsed expression if that would solve the issue and if we could this way keep the more simple version of defining a segment URL parameter. The biggest problem be though rewriting all archiving names. And where browser archiving is used and they didn't create a segment, we wouldn't be able to rename their existing archives. That's because they use simply any segment string and use it in the URL and data is archived on demand when viewing the reports. This could make reports become unavailable. The only real alternative I can think of be to store the segment hash in the segment table. This would also be future proof and would allow us more easily change the segment definitions etc without needing to worry re any archives etc. (meaning some logic like this to see if we know a segment definition and if so use the hash from the DB table #17029 (comment) ). Having the hash in the table would also work nicely with using segmentIds in segment query parameter like mentioned initially with I'm thinking the I wonder how to best make progress on this as it's quite complicated and it's easy to make things even more complicated depending what we do. Maybe we need to have eventually a quick team call with @sgiehl and @mattab to discuss possibilities. If it's quickly done we could eventually also do a few proof of concept implementation for 2 or 3 different versions but of course it's always in the detail and all the edge cases (eg when using the |
Note: this also doesn't solve every problem, because the wrong encoding can still end up changing the segment value itself. For example, if the segment is So it's not just and issue w/ the hash. |
I'm thinking the most urgent fix to solve is the hash problem as it causes bugs and results in confusing code. And I'm thinking this problem is independent of the encoding problems that we might read it differently etc and how a segment be best defined. I reckon the hash problem could be solved maybe as a first step more easily eg by storing a hash in the The Of course if we were to support a different segment definition like using JSON, then we could maybe build the hash based on the definition again. But not sure if we would be able to fully fix things because we'd still had to support the old way of defining segments not sure. |
So for this, we'd have to in core:archive look for segment by hash stored in the table. But we could just urlencode the definition and use it to calculate the hash as is done in this PR w/ the API method? I'm not sure I see the need for an extra column. The definition will always be urldecoded once from the true value, it'll never be some sort of random encoding.
I'm not sure this is true. The main issue here (for both |
We've recently had again segment encoding issues in #19617 that caused us issues and for example a segment defined in a JSON structure would have prevented this. |
Currently segments are basically triple url encoded:
$_GET/$_POST
(so the value is triple encoded at this point)All three urldecodes happen in PHP when processing a Segment. The system works, but is overly complicated and in some cases, can introduce subtle bugs. For example, based on whether the correct amount of encoding is used or not in a segment string, the hash will change, which means it's possible to use equivalent segments but get no data from the API.
We could fix this by using a different encoding scheme to encode segment condition values, and then only urlencoding the segment in the URL. For example, we could urlencode values then replace
%
characters w/$
as is done in datatable row action parameters. Then in PHP we would use the segment value found in$_GET/$_POST
w/o issue.Backwards Compatibility
To do this w/o breaking BC, we'd have to:
The text was updated successfully, but these errors were encountered: