Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly encode segment definitions from table so the hash will be the same as from query params #17029

Merged
merged 15 commits into from Feb 17, 2021

Conversation

diosmosis
Copy link
Member

Description:

This PR fixes an inconsistency in segment hashing. When sending the segment as a query parameter (ie, segment=), we always read it from $_SERVER['QUERY_STRING'], using the raw value which is urlencoded. When saving the segment in the segment table, however, we send a different query param, definition=. This parameter is not treated the same way, we don't get the raw value for it, but a urldecoded value. Which means every definition value in the segment table is not the same value that is sent by the segment selector and used in API requests. So, if we use the segment definition as is, the hash can be different then the hash with the UI query parameter (if there is an encoded value within a value).

Currently this is causing problems in archiving and invalidation, since we construct Segment instances using the stored segment definition directly. The mismatch means data is either not archived or is archived with an improper hash, depending on the segment.

This is fixed here by always doing a urlencode on the $segment['definition'] property when creating a Segment instance or getting a segment hash. We can't just urldecode the definition when querying the segment, since it breaks other features that depend on it being decoded (and, eg, encoding it before putting it into a URL). It could also be seen as a BC break.

Review

  • Functional review done
  • Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • Security review done see checklist
  • Code review done
  • Tests were added if useful/possible
  • Reviewed for breaking changes
  • Developer changelog updated if needed
  • Documentation added if needed
  • Existing documentation updated if needed

@diosmosis diosmosis added the Needs Review PRs that need a code review label Dec 30, 2020
@diosmosis diosmosis added this to the 4.2.0 milestone Dec 30, 2020
@diosmosis diosmosis added the Regression Indicates a feature used to work in a certain way but it no longer does even though it should. label Dec 30, 2020
…inition from the segment table, make sure to urlencode so the hash is the same as when we send the segment as a query parameter
@sgiehl
Copy link
Member

sgiehl commented Jan 4, 2021

@diosmosis but isn't the segment object able to handle encoded and decoded segment definitions here:

matomo/core/Segment.php

Lines 143 to 165 in 1b2769d

$subexpressionsDecoded = 0;
try {
$this->initializeSegment(urldecode($segmentCondition), $idSites);
$subexpressionsDecoded = $this->segmentExpression->getSubExpressionCount();
} catch (Exception $e) {
// ignore
}
$subexpressionsRaw = 0;
try {
$this->initializeSegment($segmentCondition, $idSites);
$subexpressionsRaw = $this->segmentExpression->getSubExpressionCount();
} catch (Exception $e) {
// ignore
}
if ($subexpressionsRaw > $subexpressionsDecoded) {
$this->initializeSegment($segmentCondition, $idSites);
$this->isSegmentEncoded = false;
} else {
$this->initializeSegment(urldecode($segmentCondition), $idSites);
$this->isSegmentEncoded = true;
}

Would it make more sense to have isSegmentEncoded accessible and use that to get the proper definition where it needs to be urlencoded?

@diosmosis
Copy link
Member Author

diosmosis commented Jan 4, 2021

but isn't the segment object able to handle encoded and decoded segment definitions here

@sgiehl That is not what this issue is about, this is about the getHash() function, it changes based on the content of the segment, and because of matomo's inexplicably complicated way to url encode a segment, the definition stored in the database is not the same as what is sent in the URL in the UI. You can read more in the issue description.

@sgiehl
Copy link
Member

sgiehl commented Jan 5, 2021

@diosmosis Adjusting the segment hash methods doesn't work in this case? Maybe something like

    public function getHash()
    {
        if (empty($this->string)) {
            return '';
        }
        return self::getSegmentHash($this->string, $this->isSegmentEncoded);
    }

    public static function getSegmentHash($definition, $isEncoded=true)
    {
        return $isEncoded ? md5(urldecode($definition)) : md5($definition);
    }

@diosmosis
Copy link
Member Author

@sgiehl no, because segments like pageUrl%3D@http%25253A%25252F%25252Fserenity.org%25252Fdocs%25252Fjavascript-tracking%25252F and
pageUrl=@http%253A%252F%252Fserenity.org%252Fdocs%252Fjavascript-tracking%252F are the same and are both "encoded", but have different hashes when used in Segment. Ideally, we'd just fix segment encoding so it wasn't so complicated, but that's not on the milestone.

@sgiehl
Copy link
Member

sgiehl commented Jan 5, 2021

Ideally, we'd just fix segment encoding so it wasn't so complicated, but that's not on the milestone.

Do we have an issue for that? If not maybe we should create one.

@diosmosis
Copy link
Member Author

Do we have an issue for that? If not maybe we should create one.|

I don't think there's an agreed upon solution, but I can make an issue.

@sgiehl
Copy link
Member

sgiehl commented Jan 7, 2021

Ok. I guess as kind of a quick fix the changes here are good to merge. But we really should improve the encoding strategy in one of the next releases if possible. I'm sure we will run into other issues with this technical dept otherwise. ping @tsteur

@diosmosis
Copy link
Member Author

@tsteur can you take a look at this PR?

@tsteur
Copy link
Member

tsteur commented Jan 15, 2021

@diosmosis will this break BC in any way?

I've looked at the changes but it's hard to tell what it does and things seem to get quite complicated with all the encodings. It's quite hard to understand this all.

Is there maybe a way we can achieve the same fix within the segment class? I wonder if we could do like in the DB a query whether we know a certain segment definition and if not found, we urlencode the definition and check again if there's an existing definition and if not found we may try urldecode() and if then not found we use the original hash? That be quite time consuming maybe and maybe we'd need to keep a cache of all known segment definitions. Just a random thought and might make things not better.

It's a bit hard to understand what this issue fixes exactly and how etc. Is it otherwise maybe also possible to add a test? or the removed parts in the test in this PR are basically the proof this works?

@diosmosis
Copy link
Member Author

@tsteur I couldn't make a system test like ArchiveCronTest fail to demonstrate the bug, the best I could do is create a new test that demonstrates how the hash is different. In a new system test for the SegmentEditor plugin, we create a segment via an HTTP request, then get an API method that returns the hash and add a hash to the SegmentEditor.getAll return. They should be the same, but if we don't urlencode the segment definition in SegmentEditor.getAll, they won't be.

Note: there are also some changes to ArchiveCronTest. I noticed it wasn't 100% correct. The new expected test file changes are unrelated to this issue (it has to do w/ saving the config changes re enforce no browser archiving; the "no auto archive" segment should still be auto archived).

@tsteur
Copy link
Member

tsteur commented Jan 17, 2021

@diosmosis Did you have a chance to check if a fix in the segment constructor would be maybe a solution where we try to find a matching segment definition from the list of defined segments? I reckon this might work more reliably then maybe.

BTW was this already an issue on Matomo 3 or how come it's a problem now? Do we know why this changed? Just checking because getting confused by all the URL encodes etc.

@diosmosis
Copy link
Member Author

@tsteur

Did you have a chance to check if a fix in the segment constructor would be maybe a solution where we try to find a matching segment definition from the list of defined segments? I reckon this might work more reliably then maybe.

How would this fix the issue? That sounds unrelated.

BTW was this already an issue on Matomo 3 or how come it's a problem now? Do we know why this changed? Just checking because getting confused by all the URL encodes etc.

Usually we just append the definition to a URL (eg, climulti, scheduled reports), and in so doing we have to urlencode. This problem has to do w/ the hash, so constructing a Segment w/ the urldecoded definition is the problem. We do this a lot more now in core:archive since we check, eg, if an archive exists already + other things.

@tsteur
Copy link
Member

tsteur commented Jan 18, 2021

I was thinking something like

diff --git a/core/Segment.php b/core/Segment.php
index 20761ffa17..58cf0b0305 100644
--- a/core/Segment.php
+++ b/core/Segment.php
@@ -118,6 +118,22 @@ class Segment
      */
     public function __construct($segmentCondition, $idSites, Date $startDate = null, Date $endDate = null)
     {
+        // would need to see how to solve it for `getSegmentHash`... maybe that method shouldn't be static...
+        $existingDefinitions = $this->getAllSegmentDefinitionsForSites($idSites);
+        $variations = [
+            $segmentCondition,
+            urlencode($segmentCondition),
+            urldecode($segmentCondition), // maybe not that one
+        ];
+        foreach ($variations as $segmentVariation) {
+            foreach ($existingDefinitions as $existingDefinition) {
+                if ($existingDefinition === $segmentCondition) {
+                    $segmentCondition = $segmentVariation;
+                    break 2;
+                }
+            }
+        }
+
         $this->segmentQueryBuilder = StaticContainer::get('Piwik\DataAccess\LogQueryBuilder');
 

This way we don't need to know when to pass it urlencoded and when not. It would always prefer the variation that is known from the segment table (if any matches).

@diosmosis
Copy link
Member Author

This way we don't need to know when to pass it urlencoded and when not.

The point of this issue is we ALWAYS pass it urlencoded if getting from a segment definition. Checking for random variations doesn't work, because the definition in the DB is not the correct version.

@tsteur
Copy link
Member

tsteur commented Jan 18, 2021

Checking for random variations doesn't work, because the definition in the DB is not the correct version.

I suppose the same would then apply when someone passes it from the UI? Just checking if this could maybe fix the issue for many cases or whether it would still be an issue depending on how a developer sends the segment string in an API request etc

@diosmosis
Copy link
Member Author

I suppose the same would then apply when someone passes it from the UI?

This isn't about segments passed from the UI, those are always "correct". The problem is (as described above), is that these values are used "raw" in API methods, as in w/o the urldecoding that happens to $_GET entries. Whereas, this is not true for definition parameter used to add segments, so the definition is urldecoded in the segment table. They are functionally equivalent, but the hash changes which causes subtle problems.

To understand, run the test I wrote. Then in SegmentEditor.getAll, remove the urldecode in new Segment(urldecode($segment['definition'])). Run it again, it will fail. If you understand why it fails, you'll understand the issue.

depending on how a developer sends the segment string in an API request

developers sending segments in the API should match how the UI sends them, otherwise the hash will be different and there will be problems. And currently, if they send what is in the definition w/o urlencoding it properly, the hash will be different than what is archived and it will not select any data.

But this is kind of impossible since if developers use the segment table definition in a URL, they will have to urlencode($segment['definition']), which is what the UI sends.

@tsteur
Copy link
Member

tsteur commented Jan 21, 2021

I guess the biggest problem I'm seeing is that we have like close to 180 instance creations of the segment class and it's quite hard to understand when a urlencode is needed and when not. Maybe we could somehow at least document when it's needed if it can be put in a "rule"? Also developers might often not set the segment based on the segment table definition but they construct it themselves. Although it seems to be able to handle a lot of different ways now and it's likely fine (still wondering though if there are now some formats that won't work anymore but probably not). I was hoping we could somehow take some complexity out by not needing to know when to pass it urlencoded and when not and instead detect it automatically.

image

If someone was using SegmentEditor.getAll and did an URL encode on their side, would it cause any new issues after merging this PR? We're doing this for example in Matomo Mobile. Would "Segment.get" API also need to urlencode the definition? I see eg the API now returns browserCode==%3C%27%3E%2F%22%23%24I*%25 and URL encoding it again would result in browserCode%3D%3D%253C%2527%253E%252F%2522%2523%2524I*%2525. It still seems to work because we're supporting it triple encoded. I adjusted the test you wrote and tested different encodings and I think it looks good.

I see though generally what you're meaning and I adjusted the tests you wrote many times and can see it seems to work and handle various things. Just finding it unclear when to use the urlencode and when not and wonder if there eg many other uses where we're using it wrong or not etc. It's of course a general problem with how segment definitions were implemented in the beginning.

It's also now SegmentEditor.getAll does the urlencode while some other segment model methods like getSegmentsToAutoArchive don't do the urlencode (which is probably why we're needing it in cronarchive) and we're basically needing to pay attention whether some method previously did the url encode or not. Of course we can't easily change it everywhere as there could be again too many regressions I suppose.

Can we maybe document when we need to use new Segment($definition, ...) and when to use new Segment(urlencode*($definition), ...)?

@diosmosis
Copy link
Member Author

I guess the biggest problem I'm seeing is that we have like close to 180 instance creations of the segment class and it's quite hard to understand when a urlencode is needed and when not.

Yes, this is a challenge. This PR is definitely a quick fix, ideally we would put in a more normalized encoding strategy (in the linked issue). The "rule" would be, if the segment is sourced from the segment table or from SegmentEditor.getAll, then it should be encoded before being used in the Segment class. If there is a lot of passing the string around before creating a Segment, then it would be hard to figure out what to use.

If someone was using SegmentEditor.getAll and did an URL encode on their side, would it cause any new issues after merging this PR?

No, this doesn't change the output of API methods to keep BC (and since several features seem to need it to stay that way).

It's also now SegmentEditor.getAll does the urlencode while some other segment model methods like getSegmentsToAutoArchive don't do the urlencode (which is probably why we're needing it in cronarchive) and we're basically needing to pay attention whether some method previously did the url encode or not.

Not sure about this? If I look at the PHP code getAll just returns the definition in the table. If you mean the code I added, that's JUST for calculating the hash used.

Can we maybe document when we need to use new Segment($definition, ...) and when to use new Segment(urlencode*($definition), ...)?

We can document (though I'm not sure where). It's use urlencode() when the segment comes from the segment table's definition column, and don't use it if getting the segment from Request::getRawSegmentFromRequest(). The reason being Request::getRawSegmentFromRequest() get's it from REQUEST_URI, whereas when saving a segment, SegmentEditor.add gets the definition from $_GET/$_POST.

Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for mentioning the "rule". I created https://github.com/matomo-org/developer-documentation/pull/416/files for now to document it there and also added the overall encoding strategy. Can you have a look? If more comes to your mind feel free to add more documentation.

Not sure about this? If I look at the PHP code getAll just returns the definition in the table. If you mean the code I added, that's JUST for calculating the hash used.

👍 I realise this only now

Also left one other comment. Ideally we would have calculated the segment hash based on how it's defined in the segment table but this would have needed to be implemented from the beginning.


public function getSegmentHash($idSite, $segment)
{
$segment = new Segment($segment, [(int) $idSite]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi do we maybe need to check for permissions here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the exampleplugin, so i guess it's not important, but it might be better to have a good demonstration, so I added it

@tsteur
Copy link
Member

tsteur commented Jan 22, 2021

also be great to know what a segment would look like with a $ in the URL and how this would be constructed in PHP the segment if someone was to create a segment definition using PHP.

@diosmosis
Copy link
Member Author

@tsteur ok to merge this?

@diosmosis
Copy link
Member Author

ping @tsteur

@tsteur
Copy link
Member

tsteur commented Feb 9, 2021

@diosmosis yes 👍

@diosmosis diosmosis merged commit 4196412 into 4.x-dev Feb 17, 2021
@diosmosis diosmosis deleted the segment-definition-fix branch February 17, 2021 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review PRs that need a code review Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants