New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly encode segment definitions from table so the hash will be the same as from query params #17029
Conversation
…inition from the segment table, make sure to urlencode so the hash is the same as when we send the segment as a query parameter
2345a12
to
53a54a8
Compare
@diosmosis but isn't the segment object able to handle encoded and decoded segment definitions here: Lines 143 to 165 in 1b2769d
Would it make more sense to have isSegmentEncoded accessible and use that to get the proper definition where it needs to be urlencoded?
|
@sgiehl That is not what this issue is about, this is about the |
@diosmosis Adjusting the segment hash methods doesn't work in this case? Maybe something like public function getHash()
{
if (empty($this->string)) {
return '';
}
return self::getSegmentHash($this->string, $this->isSegmentEncoded);
}
public static function getSegmentHash($definition, $isEncoded=true)
{
return $isEncoded ? md5(urldecode($definition)) : md5($definition);
} |
@sgiehl no, because segments like |
Do we have an issue for that? If not maybe we should create one. |
I don't think there's an agreed upon solution, but I can make an issue. |
Ok. I guess as kind of a quick fix the changes here are good to merge. But we really should improve the encoding strategy in one of the next releases if possible. I'm sure we will run into other issues with this technical dept otherwise. ping @tsteur |
@tsteur can you take a look at this PR? |
@diosmosis will this break BC in any way? I've looked at the changes but it's hard to tell what it does and things seem to get quite complicated with all the encodings. It's quite hard to understand this all. Is there maybe a way we can achieve the same fix within the segment class? I wonder if we could do like in the DB a query whether we know a certain segment definition and if not found, we urlencode the definition and check again if there's an existing definition and if not found we may try urldecode() and if then not found we use the original hash? That be quite time consuming maybe and maybe we'd need to keep a cache of all known segment definitions. Just a random thought and might make things not better. It's a bit hard to understand what this issue fixes exactly and how etc. Is it otherwise maybe also possible to add a test? or the removed parts in the test in this PR are basically the proof this works? |
…s is just a test fix? if I undo the changes the new test changes still pass)
@tsteur I couldn't make a system test like ArchiveCronTest fail to demonstrate the bug, the best I could do is create a new test that demonstrates how the hash is different. In a new system test for the SegmentEditor plugin, we create a segment via an HTTP request, then get an API method that returns the hash and add a hash to the SegmentEditor.getAll return. They should be the same, but if we don't urlencode the segment definition in SegmentEditor.getAll, they won't be. Note: there are also some changes to ArchiveCronTest. I noticed it wasn't 100% correct. The new expected test file changes are unrelated to this issue (it has to do w/ saving the config changes re enforce no browser archiving; the "no auto archive" segment should still be auto archived). |
@diosmosis Did you have a chance to check if a fix in the segment constructor would be maybe a solution where we try to find a matching segment definition from the list of defined segments? I reckon this might work more reliably then maybe. BTW was this already an issue on Matomo 3 or how come it's a problem now? Do we know why this changed? Just checking because getting confused by all the URL encodes etc. |
How would this fix the issue? That sounds unrelated.
Usually we just append the definition to a URL (eg, climulti, scheduled reports), and in so doing we have to urlencode. This problem has to do w/ the hash, so constructing a Segment w/ the urldecoded definition is the problem. We do this a lot more now in core:archive since we check, eg, if an archive exists already + other things. |
I was thinking something like
This way we don't need to know when to pass it urlencoded and when not. It would always prefer the variation that is known from the |
The point of this issue is we ALWAYS pass it urlencoded if getting from a segment definition. Checking for random variations doesn't work, because the definition in the DB is not the correct version. |
I suppose the same would then apply when someone passes it from the UI? Just checking if this could maybe fix the issue for many cases or whether it would still be an issue depending on how a developer sends the segment string in an API request etc |
This isn't about segments passed from the UI, those are always "correct". The problem is (as described above), is that these values are used "raw" in API methods, as in w/o the urldecoding that happens to To understand, run the test I wrote. Then in SegmentEditor.getAll, remove the
developers sending segments in the API should match how the UI sends them, otherwise the hash will be different and there will be problems. And currently, if they send what is in the definition w/o urlencoding it properly, the hash will be different than what is archived and it will not select any data. But this is kind of impossible since if developers use the segment table definition in a URL, they will have to |
I guess the biggest problem I'm seeing is that we have like close to 180 instance creations of the segment class and it's quite hard to understand when a urlencode is needed and when not. Maybe we could somehow at least document when it's needed if it can be put in a "rule"? Also developers might often not set the segment based on the segment table definition but they construct it themselves. Although it seems to be able to handle a lot of different ways now and it's likely fine (still wondering though if there are now some formats that won't work anymore but probably not). I was hoping we could somehow take some complexity out by not needing to know when to pass it urlencoded and when not and instead detect it automatically. If someone was using I see though generally what you're meaning and I adjusted the tests you wrote many times and can see it seems to work and handle various things. Just finding it unclear when to use the urlencode and when not and wonder if there eg many other uses where we're using it wrong or not etc. It's of course a general problem with how segment definitions were implemented in the beginning. It's also now Can we maybe document when we need to use |
Yes, this is a challenge. This PR is definitely a quick fix, ideally we would put in a more normalized encoding strategy (in the linked issue). The "rule" would be, if the segment is sourced from the
No, this doesn't change the output of API methods to keep BC (and since several features seem to need it to stay that way).
Not sure about this? If I look at the PHP code getAll just returns the definition in the table. If you mean the code I added, that's JUST for calculating the hash used.
We can document (though I'm not sure where). It's use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for mentioning the "rule". I created https://github.com/matomo-org/developer-documentation/pull/416/files for now to document it there and also added the overall encoding strategy. Can you have a look? If more comes to your mind feel free to add more documentation.
Not sure about this? If I look at the PHP code getAll just returns the definition in the table. If you mean the code I added, that's JUST for calculating the hash used.
👍 I realise this only now
Also left one other comment. Ideally we would have calculated the segment hash based on how it's defined in the segment table but this would have needed to be implemented from the beginning.
|
||
public function getSegmentHash($idSite, $segment) | ||
{ | ||
$segment = new Segment($segment, [(int) $idSite]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi do we maybe need to check for permissions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the exampleplugin, so i guess it's not important, but it might be better to have a good demonstration, so I added it
also be great to know what a segment would look like with a |
@tsteur ok to merge this? |
ping @tsteur |
@diosmosis yes 👍 |
Description:
This PR fixes an inconsistency in segment hashing. When sending the segment as a query parameter (ie,
segment=
), we always read it from$_SERVER['QUERY_STRING']
, using the raw value which is urlencoded. When saving the segment in the segment table, however, we send a different query param,definition=
. This parameter is not treated the same way, we don't get the raw value for it, but a urldecoded value. Which means every definition value in thesegment
table is not the same value that is sent by the segment selector and used in API requests. So, if we use the segment definition as is, the hash can be different then the hash with the UI query parameter (if there is an encoded value within a value).Currently this is causing problems in archiving and invalidation, since we construct
Segment
instances using the stored segment definition directly. The mismatch means data is either not archived or is archived with an improper hash, depending on the segment.This is fixed here by always doing a
urlencode
on the$segment['definition']
property when creating aSegment
instance or getting a segment hash. We can't just urldecode the definition when querying the segment, since it breaks other features that depend on it being decoded (and, eg, encoding it before putting it into a URL). It could also be seen as a BC break.Review