@diosmosis opened this Pull Request on December 30th 2020 Member

Description:

This PR fixes an inconsistency in segment hashing. When sending the segment as a query parameter (ie, segment=), we always read it from $_SERVER['QUERY_STRING'], using the raw value which is urlencoded. When saving the segment in the segment table, however, we send a different query param, definition=. This parameter is not treated the same way, we don't get the raw value for it, but a urldecoded value. Which means every definition value in the segment table is not the same value that is sent by the segment selector and used in API requests. So, if we use the segment definition as is, the hash can be different then the hash with the UI query parameter (if there is an encoded value within a value).

Currently this is causing problems in archiving and invalidation, since we construct Segment instances using the stored segment definition directly. The mismatch means data is either not archived or is archived with an improper hash, depending on the segment.

This is fixed here by always doing a urlencode on the $segment['definition'] property when creating a Segment instance or getting a segment hash. We can't just urldecode the definition when querying the segment, since it breaks other features that depend on it being decoded (and, eg, encoding it before putting it into a URL). It could also be seen as a BC break.

Review

  • [ ] Functional review done
  • [ ] Usability review done (is anything maybe unclear or think about anything that would cause people to reach out to support)
  • [ ] Security review done see checklist
  • [ ] Code review done
  • [ ] Tests were added if useful/possible
  • [ ] Reviewed for breaking changes
  • [ ] Developer changelog updated if needed
  • [ ] Documentation added if needed
  • [ ] Existing documentation updated if needed
@sgiehl commented on January 4th 2021 Member

@diosmosis but isn't the segment object able to handle encoded and decoded segment definitions here: https://github.com/matomo-org/matomo/blob/1b2769d0de5288906abaa3711e9afc372cfe479a/core/Segment.php#L143-L165
Would it make more sense to have isSegmentEncoded accessible and use that to get the proper definition where it needs to be urlencoded?

@diosmosis commented on January 4th 2021 Member

but isn't the segment object able to handle encoded and decoded segment definitions here

@sgiehl That is not what this issue is about, this is about the getHash() function, it changes based on the content of the segment, and because of matomo's inexplicably complicated way to url encode a segment, the definition stored in the database is not the same as what is sent in the URL in the UI. You can read more in the issue description.

@sgiehl commented on January 5th 2021 Member

@diosmosis Adjusting the segment hash methods doesn't work in this case? Maybe something like

    public function getHash()
    {
        if (empty($this->string)) {
            return '';
        }
        return self::getSegmentHash($this->string, $this->isSegmentEncoded);
    }

    public static function getSegmentHash($definition, $isEncoded=true)
    {
        return $isEncoded ? md5(urldecode($definition)) : md5($definition);
    }
@diosmosis commented on January 5th 2021 Member

@sgiehl no, because segments like pageUrl%3D<a class='mention' href='https://github.com/http'>@http</a>%25253A%25252F%25252Fserenity.org%25252Fdocs%25252Fjavascript-tracking%25252F and
pageUrl=<a class='mention' href='https://github.com/http'>@http</a>%253A%252F%252Fserenity.org%252Fdocs%252Fjavascript-tracking%252F are the same and are both "encoded", but have different hashes when used in Segment. Ideally, we'd just fix segment encoding so it wasn't so complicated, but that's not on the milestone.

@sgiehl commented on January 5th 2021 Member

Ideally, we'd just fix segment encoding so it wasn't so complicated, but that's not on the milestone.

Do we have an issue for that? If not maybe we should create one.

@diosmosis commented on January 5th 2021 Member

Do we have an issue for that? If not maybe we should create one.|

I don't think there's an agreed upon solution, but I can make an issue.

@sgiehl commented on January 7th 2021 Member

Ok. I guess as kind of a quick fix the changes here are good to merge. But we really should improve the encoding strategy in one of the next releases if possible. I'm sure we will run into other issues with this technical dept otherwise. ping @tsteur

@diosmosis commented on January 14th 2021 Member

@tsteur can you take a look at this PR?

@tsteur commented on January 15th 2021 Member

@diosmosis will this break BC in any way?

I've looked at the changes but it's hard to tell what it does and things seem to get quite complicated with all the encodings. It's quite hard to understand this all.

Is there maybe a way we can achieve the same fix within the segment class? I wonder if we could do like in the DB a query whether we know a certain segment definition and if not found, we urlencode the definition and check again if there's an existing definition and if not found we may try urldecode() and if then not found we use the original hash? That be quite time consuming maybe and maybe we'd need to keep a cache of all known segment definitions. Just a random thought and might make things not better.

It's a bit hard to understand what this issue fixes exactly and how etc. Is it otherwise maybe also possible to add a test? or the removed parts in the test in this PR are basically the proof this works?

@diosmosis commented on January 17th 2021 Member

@tsteur I couldn't make a system test like ArchiveCronTest fail to demonstrate the bug, the best I could do is create a new test that demonstrates how the hash is different. In a new system test for the SegmentEditor plugin, we create a segment via an HTTP request, then get an API method that returns the hash and add a hash to the SegmentEditor.getAll return. They should be the same, but if we don't urlencode the segment definition in SegmentEditor.getAll, they won't be.

Note: there are also some changes to ArchiveCronTest. I noticed it wasn't 100% correct. The new expected test file changes are unrelated to this issue (it has to do w/ saving the config changes re enforce no browser archiving; the "no auto archive" segment should still be auto archived).

@tsteur commented on January 17th 2021 Member

@diosmosis Did you have a chance to check if a fix in the segment constructor would be maybe a solution where we try to find a matching segment definition from the list of defined segments? I reckon this might work more reliably then maybe.

BTW was this already an issue on Matomo 3 or how come it's a problem now? Do we know why this changed? Just checking because getting confused by all the URL encodes etc.

@diosmosis commented on January 17th 2021 Member

@tsteur

Did you have a chance to check if a fix in the segment constructor would be maybe a solution where we try to find a matching segment definition from the list of defined segments? I reckon this might work more reliably then maybe.

How would this fix the issue? That sounds unrelated.

BTW was this already an issue on Matomo 3 or how come it's a problem now? Do we know why this changed? Just checking because getting confused by all the URL encodes etc.

Usually we just append the definition to a URL (eg, climulti, scheduled reports), and in so doing we have to urlencode. This problem has to do w/ the hash, so constructing a Segment w/ the urldecoded definition is the problem. We do this a lot more now in core:archive since we check, eg, if an archive exists already + other things.

@tsteur commented on January 18th 2021 Member

I was thinking something like

diff --git a/core/Segment.php b/core/Segment.php
index 20761ffa17..58cf0b0305 100644
--- a/core/Segment.php
+++ b/core/Segment.php
@@ -118,6 +118,22 @@ class Segment
      */
     public function __construct($segmentCondition, $idSites, Date $startDate = null, Date $endDate = null)
     {
+        // would need to see how to solve it for `getSegmentHash`... maybe that method shouldn't be static...
+        $existingDefinitions = $this->getAllSegmentDefinitionsForSites($idSites);
+        $variations = [
+            $segmentCondition,
+            urlencode($segmentCondition),
+            urldecode($segmentCondition), // maybe not that one
+        ];
+        foreach ($variations as $segmentVariation) {
+            foreach ($existingDefinitions as $existingDefinition) {
+                if ($existingDefinition === $segmentCondition) {
+                    $segmentCondition = $segmentVariation;
+                    break 2;
+                }
+            }
+        }
+
         $this->segmentQueryBuilder = StaticContainer::get('Piwik\DataAccess\LogQueryBuilder');

This way we don't need to know when to pass it urlencoded and when not. It would always prefer the variation that is known from the segment table (if any matches).

@diosmosis commented on January 18th 2021 Member

This way we don't need to know when to pass it urlencoded and when not.

The point of this issue is we ALWAYS pass it urlencoded if getting from a segment definition. Checking for random variations doesn't work, because the definition in the DB is not the correct version.

@tsteur commented on January 18th 2021 Member

Checking for random variations doesn't work, because the definition in the DB is not the correct version.

I suppose the same would then apply when someone passes it from the UI? Just checking if this could maybe fix the issue for many cases or whether it would still be an issue depending on how a developer sends the segment string in an API request etc

@diosmosis commented on January 19th 2021 Member

I suppose the same would then apply when someone passes it from the UI?

This isn't about segments passed from the UI, those are always "correct". The problem is (as described above), is that these values are used "raw" in API methods, as in w/o the urldecoding that happens to $_GET entries. Whereas, this is not true for definition parameter used to add segments, so the definition is urldecoded in the segment table. They are functionally equivalent, but the hash changes which causes subtle problems.

To understand, run the test I wrote. Then in SegmentEditor.getAll, remove the urldecode in new Segment(urldecode($segment['definition'])). Run it again, it will fail. If you understand why it fails, you'll understand the issue.

depending on how a developer sends the segment string in an API request

developers sending segments in the API should match how the UI sends them, otherwise the hash will be different and there will be problems. And currently, if they send what is in the definition w/o urlencoding it properly, the hash will be different than what is archived and it will not select any data.

But this is kind of impossible since if developers use the segment table definition in a URL, they will have to urlencode($segment['definition']), which is what the UI sends.

@tsteur commented on January 21st 2021 Member

I guess the biggest problem I'm seeing is that we have like close to 180 instance creations of the segment class and it's quite hard to understand when a urlencode is needed and when not. Maybe we could somehow at least document when it's needed if it can be put in a "rule"? Also developers might often not set the segment based on the segment table definition but they construct it themselves. Although it seems to be able to handle a lot of different ways now and it's likely fine (still wondering though if there are now some formats that won't work anymore but probably not). I was hoping we could somehow take some complexity out by not needing to know when to pass it urlencoded and when not and instead detect it automatically.

image

If someone was using SegmentEditor.getAll and did an URL encode on their side, would it cause any new issues after merging this PR? We're doing this for example in Matomo Mobile. Would "Segment.get" API also need to urlencode the definition? I see eg the API now returns browserCode==%3C%27%3E%2F%22%23%24I*%25 and URL encoding it again would result in browserCode%3D%3D%253C%2527%253E%252F%2522%2523%2524I*%2525. It still seems to work because we're supporting it triple encoded. I adjusted the test you wrote and tested different encodings and I think it looks good.

I see though generally what you're meaning and I adjusted the tests you wrote many times and can see it seems to work and handle various things. Just finding it unclear when to use the urlencode and when not and wonder if there eg many other uses where we're using it wrong or not etc. It's of course a general problem with how segment definitions were implemented in the beginning.

It's also now SegmentEditor.getAll does the urlencode while some other segment model methods like getSegmentsToAutoArchive don't do the urlencode (which is probably why we're needing it in cronarchive) and we're basically needing to pay attention whether some method previously did the url encode or not. Of course we can't easily change it everywhere as there could be again too many regressions I suppose.

Can we maybe document when we need to use new Segment($definition, ...) and when to use new Segment(urlencode*($definition), ...)?

@diosmosis commented on January 21st 2021 Member

I guess the biggest problem I'm seeing is that we have like close to 180 instance creations of the segment class and it's quite hard to understand when a urlencode is needed and when not.

Yes, this is a challenge. This PR is definitely a quick fix, ideally we would put in a more normalized encoding strategy (in the linked issue). The "rule" would be, if the segment is sourced from the segment table or from SegmentEditor.getAll, then it should be encoded before being used in the Segment class. If there is a lot of passing the string around before creating a Segment, then it would be hard to figure out what to use.

If someone was using SegmentEditor.getAll and did an URL encode on their side, would it cause any new issues after merging this PR?

No, this doesn't change the output of API methods to keep BC (and since several features seem to need it to stay that way).

It's also now SegmentEditor.getAll does the urlencode while some other segment model methods like getSegmentsToAutoArchive don't do the urlencode (which is probably why we're needing it in cronarchive) and we're basically needing to pay attention whether some method previously did the url encode or not.

Not sure about this? If I look at the PHP code getAll just returns the definition in the table. If you mean the code I added, that's JUST for calculating the hash used.

Can we maybe document when we need to use new Segment($definition, ...) and when to use new Segment(urlencode*($definition), ...)?

We can document (though I'm not sure where). It's use urlencode() when the segment comes from the segment table's definition column, and don't use it if getting the segment from Request::getRawSegmentFromRequest(). The reason being Request::getRawSegmentFromRequest() get's it from REQUEST_URI, whereas when saving a segment, SegmentEditor.add gets the definition from $_GET/$_POST.

Powered by GitHub Issue Mirror