Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix url garbled issue #7866

Closed

Conversation

saleemkce
Copy link
Contributor

Fix for issue #5014

@mattab I have already fixed this issue in one of my previous commits here (then closed): #7786 but unfortunately 3 or more others files got incorrectly merged into my commit branch.

So, this time I have overwritten those 3 files taken directly from remote master. So they are all now gone. But still when you look this file libs/pChart2.1.3/class/pImage.class.php in this commit, even though I checked this file from remote master, the first line looks like this

+<?php

That is, it tells that as though I have modified first line of libs/pChart2.1.3/class/pImage.class.php file which in fact I did not. Apart from this, this is a successful big fix (already described about the pain and time it took to shoot out the bug) and a clean commit. What could we do for this?

@saleemkce
Copy link
Contributor Author

Sorry, I forgot to mention how data is correctly saved in mysql tables. Here is a snapshot showing characters from Chinese, Arabic and English in database tables.

dataintable

@tsteur
Copy link
Member

tsteur commented May 11, 2015

I tested it with the example code in #5014 (comment) and this change did not fix it for me in Chrome 42 + Mac OSX.

What did fix the issue for me was to replace urldecode (which is actually unescape) with decodeURI here: https://github.com/piwik/piwik/blob/2.13.1/js/piwik.js#L3861

I'm not sure if it has any side effects though and I'm not even sure why urldecode is there at all.

For a difference of those methods see http://xkr.us/articles/javascript/encode-compare/ . Eg Chrome and Mozilla seem to struggle with Unicode characters when using unesacpe. They also treat the + character a bit different I think.

We should also be aware that unescape was deprecated see eg https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/unescape so it might be actually broken in some browsers.

If we change to decodeURI we should maybe make sure it produces more or less the same value for most urls as we otherwise start generating new actions in piwik_log_action table.

@saleemkce
Copy link
Contributor Author

@tsteur okay, let me check it today. In the meantime, could you check if this fix help resolve the issue in the path dashboard->actions->outlinks-> URL lists? (http://issues.piwik.org/attachments/5014/Piwik.png)

@saleemkce
Copy link
Contributor Author

@tsteur oh great. I got time today to make a detailed analysis about what happens in the background. Finally, I have the details. Actually, my fix solves the bug.

You and @mattab had used the following snippet to reproduce the bug, right?

<html><head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 </head>
 <body>
 <a href="http://example.org/2013/06/%D8%A5%D9%86%D8%B4%D8%A7%D8%A1-%D9%86%D8%B3%D8%AE%D8%A9-%D9%85%D8%AE%D8%B5%D8%B5%D8%A9-%D9%86%D8%B8%D8%A7%D9%85-%D8%AA%D8%B4%D8%BA%D9%8A%D9%84-%D8%A3%D9%88%D8%A8%D9%86%D8%AA%D9%88-%D8%A8/">it works (not)</a>

 <!-- Piwik -->
<script type="text/javascript">
  var _paq = _paq || [];
  _paq.push(['trackPageView']);
  _paq.push(['enableLinkTracking']);
  (function() {
    var u=(("https:" == document.location.protocol) ? "https" : "http") + "://localhost/piwik-master/";
    _paq.push(['setTrackerUrl', u+'piwik.php']);
    _paq.push(['setSiteId', 1]);
    var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.type='text/javascript';
    g.defer=true; g.async=true; g.src=u+'piwik.js'; s.parentNode.insertBefore(g,s);
  })();
</script>
<noscript><p><img src="http://localhost/piwik-master/piwik.php?idsite=1" style="border:0;" alt="" /></p></noscript>
<!-- End Piwik Code -->

If you look at this snippet, you had all used the link in raw garbled form like

<a href="http://example.org/2013/06/%D8%A5%D9%86%D8%B4%D8%A7%D8%A1-%D9%86%D8%B3%D8%AE%D8%A9-%D9%85%D8%AE%D8%B5%D8%B5%D8%A9-%D9%86%D8%B8%D8%A7%D9%85-%D8%AA%D8%B4%D8%BA%D9%8A%D9%84-%D8%A3%D9%88%D8%A8%D9%86%D8%AA%D9%88-%D8%A8/">it works (not)</a>

Yes, this is a huge collection of garbage representing the arabic words => إنشاء-نسخة-مخصصة-نظام-تشغيل-أوبنتو-ب

The problem is, we all earlier reproduced the bug with this crappy anchor tag which is incorrect. The correct way would have been like this,

<a target="_blank" href="http://example.org/2013/06/إنشاء-نسخة-مخصصة-نظام-تشغيل-أوبنتو-ب">it works (not)</a>

The important thing, the js tracker code reads the text in its raw form. I.e if we use anchor tag like the example two above, the js tracker code reads the data in its raw form (no crap, no garbled form ) and sends it to database purely as it is. This is what we all want. Now, the screenshots and proofs.

  • When I used the anchor tag (example 2) above, I got the following data in browser console. Works great in all major browsers,

correct_way_to_save_url_data

  • In mysql database, everything looks fine,

correct_way_to_save_url_data_in_db

In this screenshot, please note id 81 and 82.
Id 81 - The data is crappy and incorrect when we use anchor tag 1 above.
Id 82 - The data is correct and no garbling takes places because it gets data from anchor tag 2.
Verified at following breadcrumbs (looks correct elsewhere too),

  • ->places: actions->outlinks
  • ->visitiors->referrs->websites
  • ->dashboard->detailed visitor logs

I have been playing my hands dirt with URLs in PHP/JS for more than 3 years. So, it doesn't amaze me. Hope, everything is clear now.

One quick request: please take time to work on this and make sure to move it to core sooner rather than later.

@tsteur
Copy link
Member

tsteur commented May 12, 2015

I tested with the following link:

<a target="_blank" href="http://example.org/2013/06/إنشاء-نسخة-مخصصة-نظام-تشغيل-أوبنتو-ب">it works (not)</a>

and this PR fixes this case indeed. Possibly we should also replace sourceElement.href with href here: https://github.com/piwik/piwik/blob/2.13.1/js/piwik.js#L3312 . What do you think @saleemkce ? Probably not needed but might be good to replace as well.

To finally resolve #5014 and to handle links like this:

 <a href="http://example.org/2013/06/%D8%A5%D9%86%D8%B4%D8%A7%D8%A1-%D9%86%D8%B3%D8%AE%D8%A9-%D9%85%D8%AE%D8%B5%D8%B5%D8%A9-%D9%86%D8%B8%D8%A7%D9%85-%D8%AA%D8%B4%D8%BA%D9%8A%D9%84-%D8%A3%D9%88%D8%A8%D9%86%D8%AA%D9%88-%D8%A8/">it works (not)</a>

we need to replace urldecode with decodeURI. We can do this in another pull request.

@saleemkce
Copy link
Contributor Author

I didn't find any issue with this line number (piwik.js#L3312 ) when I was digging and analyzing line by line. I think, we probably don't need a change here. But, if you are sure that the change will be helpful & impacting, we can do this as you suggest.

Scenario 1:

To finally resolve #5014 and to handle links like this:

<a href="http://example.org/2013/06/%D8%A5%D9%86%D8%B4%D8%A7%D8%A1-%D9%86%D8%B3%D8%AE%D8%A9-%D9%85%D8%AE%D8%B5%D8%B5%D8%A9-%D9%86%D8%B8%D8%A7%D9%85-%D8%AA%D8%B4%D8%BA%D9%8A%D9%84-%D8%A3%D9%88%D8%A8%D9%86%D8%AA%D9%88-%D8%A8/">it works (not)</a>

we need to replace urldecode with decodeURI. We can do this in another pull request.

For this scenario 1, I don't have any fire-sure idea. Because URL should be captured in raw(not encoded) form. Even if you are going to capture this kind of URLs, it would be good to convert them to raw form before saving it in db. If we are going to address scenario 1, we should try out all possible forms of encoded URL in HTML page. We could address it in separate PR as you suggest after 365 degree analysis.

@tsteur
Copy link
Member

tsteur commented May 13, 2015

Yeah we would address the second case in a separate issue. We'd need to replace the unescape anyway as it is a deprecated method and it could suddenly stop working in some browsers.

Good to merge for me but looks like it needs a rebase.

@mattab
Copy link
Member

mattab commented May 13, 2015

Hi @saleemkce - here is a suggestion: do you mind squashing your 6 commits into one? see for example this blog post that explains how to do it: http://eli.thegreenplace.net/2014/02/19/squashing-github-pull-requests-into-a-single-commit/ - the advantage of squashing commits is that it will keep the git commit history clean and readable.

Thanks for the pull request!

@saleemkce saleemkce force-pushed the bugfix-url-garbled-issue branch 2 times, most recently from 2393aa9 to 5497fa1 Compare May 14, 2015 16:42
removed unwanted files from url-garbage-issue branch
@saleemkce
Copy link
Contributor Author

@mattab @tsteur Thanks a lot. I followed the link (http://eli.thegreenplace.net/2014/02/19/squashing-github-pull-requests-into-a-single-commit/) and did rebase correctly (even though I messed a lot in my first 6 or 7 attempts). Additionally I used this link http://makandracards.com/makandra/527-squash-several-git-commits-into-a-single-commit to spot out the difference between between 'sqash' & 'pick' to finally get the interactive rebase done successfully & correctly(I hope). I don't know clearly why it still shows 4 commits. Maybe this happened due to my earlier mistakes 10 days ago. Please don't suggest me to rebase one more time because I have already done several times. Forgive me this time, let me do this better in upcoming commits. Hopefully, I think you could merge now...

Edit, if there are any unpardonable mistakes I have currently done in this rebase, guide me. I will start rectifying immediately.

@mattab
Copy link
Member

mattab commented May 14, 2015

Hi @saleemkce - no worries. at this stage though, I recommend you simply create a new Pull request. it is much easier for you and ourselves we do not mind at all when PR are created and then closed. It's good practise to close the old PR, then reference this old Pull request #xyz when creating the new PR with the clean commit. it's great to see you're learning and making such contributions to Piwik!

@saleemkce
Copy link
Contributor Author

@mattab @tsteur New PR created here #7917 but I am not sure why commits is shown as 10 in new PR. Accept this or the new one whichever looks good for you. Also note that if I use git log --oneline origin/master..bugfix-updated-url-garbled* in my bugfix branch, there is only one commit SHA like d7978d6 so that there is no way to do rebase from new PR too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants