Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise segment encoding strategy #17050

Open
diosmosis opened this issue Jan 5, 2021 · 18 comments
Open

Revise segment encoding strategy #17050

diosmosis opened this issue Jan 5, 2021 · 18 comments

Comments

@diosmosis
Copy link
Member

diosmosis commented Jan 5, 2021

Currently segments are basically triple url encoded:

  • each value in a condition is encoded once to allow segment operator characters
  • each value is encoded again, because the code server side urldecodes the segment condition (so the value is double encoded at this point)
  • and the entire segment is encoded in the URL and matomo uses this value, not the urldecoded one in $_GET/$_POST (so the value is triple encoded at this point)

All three urldecodes happen in PHP when processing a Segment. The system works, but is overly complicated and in some cases, can introduce subtle bugs. For example, based on whether the correct amount of encoding is used or not in a segment string, the hash will change, which means it's possible to use equivalent segments but get no data from the API.

We could fix this by using a different encoding scheme to encode segment condition values, and then only urlencoding the segment in the URL. For example, we could urlencode values then replace % characters w/ $ as is done in datatable row action parameters. Then in PHP we would use the segment value found in $_GET/$_POST w/o issue.

Backwards Compatibility

To do this w/o breaking BC, we'd have to:

  • still support old triple urlencoded segment values in the URL (preferrably by converting these segment parameters to the proper encoding)
  • and still support looking for triple urlencoded segment archives (where the hash will be different than w/ the new encoding) so we don't have to re-archive everything in the past
@diosmosis
Copy link
Member Author

Refs #17029

@tsteur
Copy link
Member

tsteur commented Jan 10, 2021

@diosmosis what be your time estimate for making this work? I suppose we can't migrate existing segments in the DB because we don't know how often they were encoded? As we still have to support old triple urlencoded segments, what would be the plan to get rid of this so we can benefit from a simpler solution? AKA would we basically simply say in 3-4 years time we no longer support the old way of such definitions? Because people delete their raw data but keep historical data we might lose reports eventually.

@diosmosis
Copy link
Member Author

@tsteur

@diosmosis what be your time estimate for making this work?

I'd say 1-2 days including automated tests. It changes should only be in a couple places.

I suppose we can't migrate existing segments in the DB because we don't know how often they were encoded?

I think the biggest issue w/ archive table segment hashes is that we don't always know what the segments are. Not every segment used will be in the segment table if browser archiving of segments is enabled.

As we still have to support old triple urlencoded segments, what would be the plan to get rid of this so we can benefit from a simpler solution?

Only thing I can think of would be to migrate on demand until matomo 5 then say if it's not migrated already it won't be. Or... we could have an update that does the migration for segments in the segment table, then provide a command to do it for custom segments, where the segment would be specified via a CLI parameter. Then we can have a notification that pops up if using a triple encoded segment + there's no data saying they might need to migrate.

What do you think?

@tsteur
Copy link
Member

tsteur commented Jan 22, 2021

@diosmosis wondering: I understand the change would be mostly to fix the hash problem? If that's the case, could we in future maybe somehow store the hash in the segment table and then detect the correct segment definition from the segment table like this maybe? #17029 (comment) . We could maybe generate the hash on "segment" creation and if no segment was found in the table (eg when using real time archiving) then we'd simply generate the hash based on the definition like it used to as it then mostly shouldn't matter maybe?

Just wondering if this would be a solution cause if we introduced a new way, then at least for 2-3 years we'd still need to support the old way of segment definitions until we introduce Matomo 5 and can break API. From that point we'd probably still want to support previously generated segment definitions for 2-3 years so we probably have to always support the old way for at least 5 years as from what I see we maybe can't migrate existing segment definitions.

I assume there's still though the problem that users might not encode the value in a condition etc. So I don't know if it's possible but eventually say from Matomo 5 could maybe check if we can enforce correct encoding from that moment.

I'm not too much into it though so just checking what is possible and what the root problems are etc.

@diosmosis
Copy link
Member Author

@diosmosis wondering: I understand the change would be mostly to fix the hash problem?

  • to fix the hash problem (different # of encodings on equivalent segments having different hashes)
  • to have a simpler strategy and eventually simpler code
  • to have more readable segments
  • to never have to deal w/ weird segment encoding issues where a segment is encoded once but needs to be encoded three times or whatever

We could maybe generate the hash on "segment" creation and if no segment was found in the table (eg when using real time archiving) then we'd simply generate the hash based on the definition like it used to as it then mostly shouldn't matter maybe?

I think I understand what you're talking about and I think it might work, but it seems like adding more complexity to an already too complicated solution.

From that point we'd probably still want to support previously generated segment definitions for 2-3 years so we probably have to always support the old way for at least 5 years as from what I see we maybe can't migrate existing segment definitions.

Well "support the old way" for me is just "redirect to new segment encoding URL if old encoding is found" and "include the old hash when selecting archives", so I don't think it's that much extra work. Assuming I'm not forgetting anything.

@tsteur
Copy link
Member

tsteur commented Jan 22, 2021

@diosmosis can you show how some examples would look like eg for page URLs with various special characters vs what they currently need to look like? They would still need to be encoded right but of course less often maybe. I suppose explaining developers to replace eg % with $ might also not be trivial.

Well "support the old way" for me is just "redirect to new segment encoding URL if old encoding is found" and "include the old hash when selecting archives", so I don't think it's that much extra work. Assuming I'm not forgetting anything.

Not sure how much complexity this adds (to me this sounds bit more complicated but hard to say). I suppose sometimes even just adding a redirect could technically break some APIs and cause breaking things so we would need to be careful there and avoid redirects but convert it internally.

@diosmosis
Copy link
Member Author

diosmosis commented Jan 22, 2021

@diosmosis can you show how some examples would look like eg for page URLs with various special characters vs what they currently need to look like? They would still need to be encoded right but of course less often maybe. I suppose explaining developers to replace eg % with $ might also not be trivial.

readable segment (just for illustration): pageUrl =@ http://serenity.org/docs/javascript-tracking/
segment in URL (current way): segment=pageUrl%3D%40http%25253A%25252F%25252Fserenity.org%25252Fdocs%25252Fjavascript-tracking%25252F

segment w/ new encoding outside of URL: pageUrl=@http$3A$2F$2Fserenity.org$2Fdocs$2Fjavascript-tracking$2F
segment in URL: segment=pageUrl%3D%40http%243A%242F%242Fserenity.org%242Fdocs%242Fjavascript-tracking%242F

Note: doing it this way means we don't have to worry about how often we urldecode() since the value encoding isn't the same format, and we can just get the value from $_GET/$_POST instead of having to go through `Request::getRawSegment... etc.)

To detect for old encoding, we check if there is a % in the value in $_GET/$_POST.
To convert from old encoding, we split on operators, decode segment values appropriate number of times (twice), then re-encode properly, then re-combine.

I suppose sometimes even just adding a redirect could technically break some APIs and cause breaking things so we would need to be careful there and avoid redirects but convert it internally.

Don't have to redirect, could just change the value in $_GET/$_POST. Maybe REQUEST_URI too.

@tsteur
Copy link
Member

tsteur commented Jan 22, 2021

@diosmosis can you also quickly show what this would look like in PHP to create such a segment definition and what the documentation for users would say on how such a definition would need to be created? That's often a good indicator how easy it'll be for them

@diosmosis
Copy link
Member Author

@tsteur for this specific encoding would be like:

$segment = 'pageUrl=@' . str_replace('%', '$', urlencode('http://serenity.org/docs/javascript-tracking/'));
$url = 'segment=' . urlencode($segment); // or just use http_build_query or something

I guess the str_replace could be seen as more complicated. If there's a builtin encode function that's better we could use that as well.

We could also allow quotes or something like that, eg:

$segment = 'pageUrl=@"http://serenity.org/docs/javascript-tracking/"'

Not sure if this would be better or worse.

@tsteur
Copy link
Member

tsteur commented Jan 24, 2021

So it be like

echo urlencode('pageUrl=@' . str_replace('%', '$', urlencode('http://serenity.org/docs/javascript-tracking/?x=w$5%6l435k343&+foo')) . ',' . 'userId=@' . str_replace('%', '$', urlencode('foobarbayz"w$5%6l435k343&+foo')));
// or maybe better `rawurlencode`.

It would look like this when we receive the request:

pageUrl=@http$3A$2F$2Fserenity.org$2Fdocs$2Fjavascript-tracking$2F$3Fx$3Dw$245$256l435k343$26$2Bfoo,userId=@foobarbayz$22w$245$256l435k343$26$2Bfoo

And to know whether someone actually encoded the value we would look if there is a dollar character? Like userId=$59595 then the value might be url encoded and replaced with a dollar or not? Or we assume any time there is a $ that it is urlencoded? Eg when you change above then in PHP to userId=%59595 and urldecode it, it results in userId=Y595. So if users aren't using it correctly it will still result in errors (that are hard to debug/troubleshoot) but this might be unavoidable anyway.

In the documentation we would need to explain it like this I suppose

The value within each segment conditions needs to be url encoded according to RFC 3986 and then all characters need to be replaced with a dollar sign ($). The entire segment conditions then needs to be URL encoded again before sending the HTTP request. Make sure the entire segment parameter will be only url encoded once. If your Network or HTTP class also encodes all URL parameters, then you must not encode the entire segment parameter. Here is an example in PHP:...

I'm not sure yet this is really a simpler strategy and simpler code? Also can't really see how it's more readable compared to just:

echo urlencode('pageUrl=@' . rawurlencode('http://serenity.org/docs/javascript-tracking/?x=w$5%6l435k343&+foo') . ',' . 'userId=@' . rawurlencode('foobarbayz"w$5%6l435k343&+foo'));

without the str_replace. Technically, I don't think it would need three URL encode maybe and two would be enough?

to never have to deal w/ weird segment encoding issues where a segment is encoded once but needs to be encoded three times or whatever

See above that might be still the case when they use the $ sign as part of a value but don't encode it? Or we wouldn't know if they used the old or new way of defininig a segment?

I might be seeing it wrong but this might not be the solution yet. If we change it, maybe might need to think about using some JSON or so but this can result again maybe in different errors not sure. Random thought could be also in the future to have some kind of versioning in there or so. Like "this is segment definition version v2" or something. Not sure it would help (maybe not) but maybe it could for the future to abstract some things this way and be able to change it more easily. Might not be needed...

I suppose what would make the code base a lot simpler is the segment hash issue and avoids having random issues. This we could maybe solve also differently.

@diosmosis
Copy link
Member Author

diosmosis commented Jan 25, 2021

And to know whether someone actually encoded the value we would look if there is a dollar character? Like userId=$59595 then the value might be url encoded and replaced with a dollar or not? Or we assume any time there is a $ that it is urlencoded?

Not quite. We only check for the old encoding to convert. So it'd be something like:

  • check if '%' exists in value from $_GET/$_POST. if it does, it's encoded the old way, so convert to new way.
  • if not (or after converting), we assume it's encoded with the new way, so $ will always mean an encoded character

I'm not sure yet this is really a simpler strategy and simpler code? Also can't really see how it's more readable compared to just:

The value actually has to be encoded twice, and then the entire segment encoded once. Encoding the value once, then encoding a different way for the URL should be simpler. The URL value will still look complicated, but we don't care about that in the code (ie, we no longer do Request::getRawUrlSegmentValue() or whatever).

See above that might be still the case when they use the $ sign as part of a value but don't encode it? Or we wouldn't know if they used the old or new way of defininig a segment?

If a user makes a mistake in manually constructing a segment, $ would come out wrong... but we could detect there is, eg, letters after it and not numbers. That does seem more complicated

There could be other ways of encoding the values. And we could also only require it for special operator characters (ie, characters used in other parts of segment condition, like =, !, @, etc.).

I might be seeing it wrong but this might not be the solution yet. If we change it, maybe might need to think about using some JSON or so but this can result again maybe in different errors not sure.

This is what GA does and it would work (it's basically sending the segment parsed), but it's rather different from our normal segment definitions. Wasn't sure if that would be acceptable.

If thinking about alternative solutions, I can think of the following changes:

  • we can construct the hash based on the parsed segment tree, rather than the definition string
  • stop using the raw URL encoded value and just not decode the entire string
  • don't urldecode each segment condition so we only have to encode the value once (then the whole query parameter once)

This still means we have to support the old triple encoded values though, not sure if there's more chance of regressions here. If we only do the first and second it might work too.

EDIT: Though even w/ this it won't prevent segment encoding issues, as I remember there was also an issue in the UI for segments where the segment value had %25 for example (as the raw value). We were only encoding the segment value once, so it was decoded as + in the php server and didn't match any visits (because the actual URLs had %25 in them). This is all due to sharing URL encoding w/ segment value encoding.

@tsteur
Copy link
Member

tsteur commented Jan 25, 2021

@diosmosis I'll need to read your comment properly tomorrow or in 2 days. Without thinking too much I wouldn't see an issue with changing the structure completely so things make a lot more sense instead of maybe ending up with similar issues and it would clearly let us know which variation / version is used etc.

It's probably unrelated but in future we might also support filters ( #15438 ) and maybe even sequences (where a visitor needs to match a sequence of actions instead of just one action) although for performance reasons we might not be able to. Anyway. A different format might also help with other features in the future although this shouldn't influence things too much because we might not work on them and who knows what things will look like then. Just thought I mention it anyway.

Re GA is this the right doc? https://developers.google.com/analytics/devguides/reporting/core/v3/segments

@diosmosis
Copy link
Member Author

@tsteur i think https://developers.google.com/analytics/devguides/reporting/core/v4/migration#v4_8 is it (see the v4 version of API request payloads)

@tsteur
Copy link
Member

tsteur commented Jan 29, 2021

@diosmosis random thought... what if at least the UI and the archiving was to send the segmentId instead of the segment definition? Like &segment=11. If the value is numeric we'd assume it's the ID etc. I understand though this would AFAIK still not fix the issue with the hash and would only work if we also stored the hash in the segment table. As far as I understand the segment would still need to be urlencoded sometimes from the segment table etc and it wouldn't solve anything. So probably won't help.

Generally also noticing https://developer.matomo.org/api-reference/reporting-api-segmentation#segment-values-must-be-url-encoded barely mentions anything re encoding so not surprised there are overall issues. Of course the current problems are mostly also internal.

We could maybe use some JSON definition like

encodeURIComponent(JSON.stringify([
[{dimension: "pageurl", operator: "equals", match: "test1"}, 
 {dimension: "pageurl", operator: "equals", match: "OR (because within the same array we use OR) test2"}],
[{dimension: "pageurl", operator: "equals", match: "AND (because new array we use and here) test3"}, 
 {dimension: "pageurl", operator: "equals", match: "OR test4"}]
]))

The downside be though that it be a bit more complicated to explain in the developer docs maybe and the URL gets longer if used in a GET (and could get too long for some webserver/firewall configs). We would need to always POST the segment.

Alternatively, maybe we could build the hash based on the parsed expression if that would solve the issue and if we could this way keep the more simple version of defining a segment URL parameter. The biggest problem be though rewriting all archiving names. And where browser archiving is used and they didn't create a segment, we wouldn't be able to rename their existing archives. That's because they use simply any segment string and use it in the URL and data is archived on demand when viewing the reports. This could make reports become unavailable.

The only real alternative I can think of be to store the segment hash in the segment table. This would also be future proof and would allow us more easily change the segment definitions etc without needing to worry re any archives etc. (meaning some logic like this to see if we know a segment definition and if so use the hash from the DB table #17029 (comment) ). Having the hash in the table would also work nicely with using segmentIds in segment query parameter like mentioned initially with &segment=11.

I'm thinking the str_replace $ % solution will make things only more complicated because it doesn't really change all that much but adds another variation to everything.

I wonder how to best make progress on this as it's quite complicated and it's easy to make things even more complicated depending what we do. Maybe we need to have eventually a quick team call with @sgiehl and @mattab to discuss possibilities. If it's quickly done we could eventually also do a few proof of concept implementation for 2 or 3 different versions but of course it's always in the detail and all the edge cases (eg when using the str_replace $ %) would maybe only appear after time.

@diosmosis
Copy link
Member Author

Alternatively, maybe we could build the hash based on the parsed expression if that would solve the issue and if we could this way keep the more simple version of defining a segment URL parameter.

Note: this also doesn't solve every problem, because the wrong encoding can still end up changing the segment value itself. For example, if the segment is pageUrl == http://abc.com/def+ghi, and the value is only double encoded in the URL, then matomo will think the segment is pageUrl == http://abc.com/def ghi, which will not match the correct visits.

So it's not just and issue w/ the hash.

@tsteur
Copy link
Member

tsteur commented Jan 29, 2021

I'm thinking the most urgent fix to solve is the hash problem as it causes bugs and results in confusing code. And I'm thinking this problem is independent of the encoding problems that we might read it differently etc and how a segment be best defined.

I reckon the hash problem could be solved maybe as a first step more easily eg by storing a hash in the segment table. This way we could likely get rid of various code like the urlencode we recently added to Segment::__construct and likely fix most segment related archiving issues. On core update we could populate the correct hash for each configured segment. There could be problems though with this solution although I currently can't think of any (besides maybe performance of having to query segments and race conditions between segment creation and updating etc and cache invalidation). Of course there might be also other solutions although various encodings seem to always be possible to cause issues.

The + is indeed a problem eg see https://3v4l.org/pNkTT and we'll need to specify in the future what kind of encoding we expect to make things work properly independent of what we'll do.

Of course if we were to support a different segment definition like using JSON, then we could maybe build the hash based on the definition again. But not sure if we would be able to fully fix things because we'd still had to support the old way of defining segments not sure.

@diosmosis
Copy link
Member Author

I reckon the hash problem could be solved maybe as a first step more easily eg by storing a hash in the segment table.

So for this, we'd have to in core:archive look for segment by hash stored in the table. But we could just urlencode the definition and use it to calculate the hash as is done in this PR w/ the API method? I'm not sure I see the need for an extra column. The definition will always be urldecoded once from the true value, it'll never be some sort of random encoding.

Of course there might be also other solutions although various encodings seem to always be possible to cause issues.

I'm not sure this is true. The main issue here (for both + + the hash) is that we're confusing URL encoding w/ encoding segment values. Since they use the same method, it's possible for decoding one thing to affect the other. If the decoding method for the url parameter didn't affect the segment value, then we wouldn't have the problem of having equivalent segments encoded differently.

@tsteur
Copy link
Member

tsteur commented Aug 17, 2022

We've recently had again segment encoding issues in #19617 that caused us issues and for example a segment defined in a JSON structure would have prevented this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants