Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure current url is not decoded too often #16628

Open
wants to merge 2 commits into
base: 4.x-dev
Choose a base branch
from
Open

Ensure current url is not decoded too often #16628

wants to merge 2 commits into from

Conversation

sgiehl
Copy link
Member

@sgiehl sgiehl commented Oct 29, 2020

Let's assume we have the search url on the marketplace containing a & in search query (see #14715):
https://plugins.matomo.org/search?query=Heatmap+%26+session+recording&post_type=product

Currently Matomo tracker js would run a safeDecodeWrapper on it. So the resulting url, that would be used for tracking would be
https://plugins.matomo.org/search?query=Heatmap+&+session+recording&post_type=product, which is actually not the same

fixes #14715

@sgiehl sgiehl added not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. Needs Review PRs that need a code review labels Oct 29, 2020
@sgiehl sgiehl added this to the 4.1.0 milestone Oct 29, 2020
@sgiehl sgiehl changed the title Ensure current url is not decoded to often Ensure current url is not decoded too often Oct 29, 2020
@tsteur
Copy link
Member

tsteur commented Oct 29, 2020

We might need to check if there's any other way to fix this as this was added because of #8413 in #8510

Anyway, let's for now focus on Matomo 4 release etc.

@sgiehl
Copy link
Member Author

sgiehl commented Dec 21, 2020

build js

@sgiehl
Copy link
Member Author

sgiehl commented Dec 21, 2020

@tsteur This might actually fix some more issues that are related to encoding and page charset.
It shouldn't make any problems like #8413 or #8510, as before decodeURIComponant was used, which throwed the error. Now the url would not be decoded. I'm actually still wondering why that should be needed at all. Are there any cases where the location object might hold any values that are double encoded?

@tsteur
Copy link
Member

tsteur commented Dec 21, 2020

@sgiehl I think the original urldecode was added because of #7220 and maybe also #7218 . If we now removed that method, we'd have a problem that URL's would be likely stored URL encoded in the actions meaning it will create a new ID action for every URL and it can cause a lot of problems eg if some reports don't group by URL etc. Might need to test this but there's a chance we can't merge this.

@mattab mattab modified the milestones: 4.1.0, 4.2.0 Dec 21, 2020
@sgiehl
Copy link
Member Author

sgiehl commented Dec 22, 2020

I guess since #7220 a lot has happened in browser development. Seems the url and referrer are no longer url encoded in modern browsers

@tsteur
Copy link
Member

tsteur commented Dec 22, 2020

@sgiehl do you have maybe some reference to that where browsers changed that? I just tested with some URLs and it seems to be still URL encoded. I would assume it could break a lot of things if browsers simply changed that but it's possible they did.

eg viewing /test.html?x=34L#$3L#4L1>124:56:7&test then some characters were encoded
image

@sgiehl
Copy link
Member Author

sgiehl commented Dec 23, 2020

Ok. Now I understand what you are referring to. That's still normal that some special characters are encoded. But it's actually not the full url. So we shouldn't manipulate the complete url using any decodeWrapper.
Let's assume you request something like test.html?q=test%26q%3Dp, which is a valid request with one query parameter.
After decoding it you get test.html?q=test&q=p, which is obviously something else.

@mattab mattab modified the milestones: 4.2.0, 4.3.0 Feb 22, 2021
@mattab mattab modified the milestones: 4.3.0, 4.4.0 May 26, 2021
@diosmosis
Copy link
Member

@sgiehl I tested this PR against 4.x-dev using matomo-org/test-examples#9 to track encoded urls and noticed there are changes in what's tracked now.

For ISO-8859-1 encoded pages:

encoding1

The first action is tracked via 4.x-dev, and the second via this branch.

For other URL encoding:

(4.x-dev encoding)
4 x-dev-urlencoding-test

(this branch encoding)
fixurl-encoding-test

In one case it fixes the bug referenced in the PR, but for others it changes what's tracked. I'm not sure exactly what's ok to change (cc @tsteur), but it might break some segments if we just merge this.

I didn't test referrer URL tracking, but that would be affected too.

@sgiehl
Copy link
Member Author

sgiehl commented May 27, 2021

I'm actually not sure regarding this. But a special character like ö in an URI is actually not correct and even a browser would convert it to %C3%B6 in requests, so the tracked url would actually be correct. But (like most browsers) we should maybe show the readable value in UI.
Another possibility would be to adjust the JS so only characters not representing a url reserverd character will be decoded. That would mean we would need to find a way to not decode %26 to & and maybe also %2B to +. But not sure if that would fix all potential problems with the url manipulation we are currently doing

@diosmosis
Copy link
Member

I think for a bug fix it would be better to maintain existing behavior if possible. We can't know exactly how much might be affected downstream, it could be more than just what is displayed to the user (eg, segments no longer matching).

I'm also noticing the first test doesn't save the file in test-examples in ISO-8859-1. Not sure exactly how to do that.

@tsteur
Copy link
Member

tsteur commented Jul 22, 2021

@sgiehl as in #15060 the user mentions it's only a problem for site search. Is there a chance to only change this behaviour for site search but not the entire URL? (not needed to look into it for long but maybe you know )

@sgiehl
Copy link
Member Author

sgiehl commented Jul 23, 2021

@tsteur don't think that is possible. We parse the page url to check if some parameters for site search are contained, so if we don't change the tracked page url, site search will stay the same buggy way.

@tsteur tsteur modified the milestones: 4.4.0, 4.7.0 Jul 25, 2021
@tsteur
Copy link
Member

tsteur commented Jul 25, 2021

👍 thanks. moving the issue to 4.7 for now as this current fix would likely introduce quite a few regressions (I remember when we added them and there were various regressions re encoding). We'll look into it again when it's a higher priority.

@justinvelluppillai
Copy link
Contributor

@tsteur milestone 4.7.0 has arrived, are you wanting this in this milestone now or still deferring for later?

@justinvelluppillai
Copy link
Contributor

hmm looks like maybe it is in 4.8.0 since the milestone got renamed or something, so still deferred 👍

@justinvelluppillai justinvelluppillai removed the Needs Review PRs that need a code review label Dec 1, 2021
@tsteur
Copy link
Member

tsteur commented Jan 13, 2022

@sgiehl be great to explore some other options if there are other ways to fix this in 4.8 milestone without possible regressions

@Anisatum
Copy link

@sgiehl be great to explore some other options if there are other ways to fix this in 4.8 milestone without possible regressions

As far as I can tell there's no way to fix it without affecting the logic that was in effect until now. Any custom logic to try and recreate the original encoding would be pretty complex, and likely to introduce new bugs in some unforeseen special case.

The only way out that comes to mind is adding a setting so each user can decide whether they want this fix applied (off by default for existing instances of Matomo, on by default when a new one is created).

@Anisatum
Copy link

Anisatum commented Feb 14, 2024

So in the end I agree with @sgiehl that decoding by default is wrong, as it will necessarily lead to the loss of information, if a URL has encoded components, but I also agree that a bug that remains unfixed for long enough becomes expected behavior, and a bug fix shouldn't change it too much. I figured the best way to do this would be to create a tracker plugin, this way if someone wants to keep the old behavior in it's entirety, they can just not install it / keep it disabled. A proof of concept is available in the feature/disable_url_decoding branch of my fork.

The decode-disabling can be triggered either from Javascript with a _paq.push(['DisableUrlDecoding.doNotDecode']); or globally with a SystemSetting. The Javascript tracker overrides the SystemSetting, so there is also a _paq.push(['DisableUrlDecoding.doDecode']); function for the case that decoding is disabled in SystemSettings, but the user wants re-enable it in specific conditions.

Right now all there is in the settings, is a simple on/off switch, but I was thinking of providing extra options, like:

  • Only disable for URLs matching a specific Regex
  • Only disable the decoding of specific query parameters

But I wanted to get some feedback before I go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Data Integrity & Accuracy not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. Technical debt Issues the will help to reduce technical debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal site search decoding bug Site search is truncating the strings which is placed after ampersand (&)
6 participants