Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On databases that have a default charset that is not utf-8, data can get corrupted #6497

Closed
ataraxie opened this issue Oct 22, 2014 · 25 comments
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@ataraxie
Copy link

Matt suggested me to create a bug report for this problem which was discussed in detail in this forum thread:
http://forum.piwik.org/read.php?2,120225,120539

Scenario

As Piwik admin user, navigate to Settings => Websites. I always get "You currently have access to 0 website". This happens when I am superuser as well as when I have dedicated "admin" permission to websites.

Problem source

Environment

  • PHP 5.5.10
  • Internet Information Services 7.5 (IIS)
  • MySQL 5.5.27
  • Windows Server 2008 R2

Since Piwik allows umlauts for site name, I consider this a Piwik bug. We have an enterprise customer installation of Piwik running with a database of ~ 40 GB. Our customer is very upset with this at the moment, since, with regard to the time I needed to figure out the problem source (see forum thread), this issue has kept us from updating from 1.12 to 2.x for several months. So I kindly ask you to consider this a major issue. Also, as the forum thread shows, I'm not the only one affected.

@mattab mattab added the Bug For errors / faults / flaws / inconsistencies etc. label Oct 22, 2014
@mattab mattab modified the milestones: Piwik 2.9.0, Short term Oct 22, 2014
@mattab
Copy link
Member

mattab commented Oct 22, 2014

@ataraxie thanks for the report. Because it's likely that this is a bug in PHP, could you please try upgrade php to latest 5.5.18 ? if you still experience the problem with this version we would further investigate. cheers

@tsteur
Copy link
Member

tsteur commented Oct 24, 2014

This is not a bug in PHP I think. I had once access to another Piwik instance with the exact same issue. In the end the values were stored "broken" in the database or so. It was not UTF8. I couldn't find the actual problem but also didn't invest so much time.

@diosmosis
Copy link
Member

If the problem was w/ the DB, then format=XML would fail, no?

@tsteur
Copy link
Member

tsteur commented Oct 24, 2014

json_encode couldn't handle those values and failed

@ataraxie
Copy link
Author

That's right. I tried with PHP 5.5.18 and the problem persists. Also: when I add use umlauts in the Piwik admin interface for a website name, it's ok. My sites were added through the API. When I check in the database, then the encoding of the sites added through the admin interface is different.

So I add a Piwik site in the admin interface with the name "äää". I do a

SELECT name FROM piwik_site WHERE idsite = 1;

The result is:

äää

When I add the site with an SQL statement or through the Piwik API, the result of the above statement will be:

äää

The second case causes the error and I assume (as @tsteur stated above) json_encode fails (I'm wondering why, if so, there's nothing written in the PHP error log).

So, I guess this is clearly an database encoding issue. I created the Piwik database on my server with the default CREATE DATABASE statement as adviced in the Piwik docs. However, if I run this:

SELECT * FROM information_schema.SCHEMATA

The result is that my Piwik database has latin1 as DEFAULT_CHARACTER_SET_NAME and latin1_swedish_ci as DEFAULT_COLLATION_NAME.

Hmmmmmmm.....

@ataraxie
Copy link
Author

Ok, I finally got something in the log files:

ERROR API[2014-10-26 18:04:45] C:\inetpub\wwwroot\piwik\core\DataTable\Renderer\Json.php(74): Warning - json_encode(): Invalid UTF-8 sequence in argument
  #0  Piwik\Error::errorHandler(...) called at [:]
  #1  json_encode(...) called at [C:\inetpub\wwwroot\piwik\core\DataTable\Renderer\Json.php:74]
  #2  Piwik\DataTable\Renderer\Json->renderTable(...) called at [C:\inetpub\wwwroot\piwik\core\DataTable\Renderer\Json.php:29]
  #3  Piwik\DataTable\Renderer\Json->render(...) called at [C:\inetpub\wwwroot\piwik\plugins\API\Renderer\Json.php:55]
  #4  Piwik\Plugins\API\Renderer\Json->renderArray(...) called at [C:\inetpub\wwwroot\piwik\plugins\API\Renderer\Json2.php:21]
  #5  Piwik\Plugins\API\Renderer\Json2->renderArray(...) called at [C:\inetpub\wwwroot\piwik\core\API\ResponseBuilder.php:260]
  #6  Piwik\API\ResponseBuilder->handleArray(...) called at [C:\inetpub\wwwroot\piwik\core\API\ResponseBuilder.php:106]
  #7  Piwik\API\ResponseBuilder->getResponse(...) called at [C:\inetpub\wwwroot\piwik\core\API\Request.php:218]
  #8  Piwik\API\Request->process(...) called at [C:\inetpub\wwwroot\piwik\plugins\API\Controller.php:33]
  #9  Piwik\Plugins\API\Controller->index(...) called at [:]
  #10  call_user_func_array(...) called at [C:\inetpub\wwwroot\piwik\core\FrontController.php:587]
  #11  Piwik\FrontController->doDispatch(...) called at [C:\inetpub\wwwroot\piwik\core\FrontController.php:95]
  #12  Piwik\FrontController->dispatch(...) called at [C:\inetpub\wwwroot\piwik\core\dispatch.php:34]
  #13  require_once(...) called at [C:\inetpub\wwwroot\piwik\index.php:46]

@mattab
Copy link
Member

mattab commented Oct 28, 2014

My sites were added through the API

Can you please paste here exact API URL used to created the website that cause the bug? I will try to re-create the issue locally. @ataraxie

@tsteur
Copy link
Member

tsteur commented Oct 28, 2014

I am pretty sure you won't be able to reproduce it. In the other instance where I experienced this bug it was also not possible to reproduce it on my instance. And it even failed to store the correct characters when adding a new site via SiteManager UI

@ataraxie
Copy link
Author

Thank your for further investigation. In my script, I'm calling this URL:

http://mypiwik/index.php?module=API&token_auth=[auth_token]&format=JSON&method=SitesManager.addSite&siteName=[siteName]&urls=[urls]

In addition to the administrator panel, the issue also occurs when I go to "All Websites" and then apply a date range.

I'm also going to do some further research in what data causes the json_encode call to fail. I'll let you know.

I don't think the issue is so hard to reproduce:

  • Create a database with DEFAULT_CHARACTER_SET_NAME latin1 (you can verify with SELECT * FROM information_schema.SCHEMATA)
  • Using MySQL workbench, change the name of a site to something with umlauts (e.g. UPDATE piwik_site SET name = 'testäää' WHERE idsite = 1;)

@ataraxie
Copy link
Author

Ok, some more information after experimenting in the source...

The interesting line where everything happens is piwik\core\DataTable\Renderer\Json.php line 74. Suppose I set the sitename of site 1 using MySQL workbench as explained above to "äää". If I do a few log/var_dump statements in the PHP file (suppose I have a putLog function), this is what happens:

  • putLog($array[1]['name'])) => "äää"
  • json_encode($array) => jsonbool(false)
  • json_encode($array[1]['name']) => jsonbool(false)
  • json_encode($array[1]['name']) => jsonbool(false)

So if the json_encode function returns jsonbool(false) on the array, there will be no data in the UI. The interesting thing is that if I do a utf8_encode on the problematic string, it will work fine (the result is a correct JSON string):

json_encode(utf8_encode($array[1]['name']))

I think the general question is: is my database encoding simply incorrect (latin1) and the requirement for Piwik to work is a UTF-8 encoded database? Note that MySQL distributions have usually default-character-set=latin1 in their my.ini. Otherwise, you'll probably need to take care of the encoding when sites are saved because you probably won't want to call a utf8_encode on all site names before calling json_encode. Another problem with that, however, will be that as soon as someone changes a site name using a tool like MySQL workbench, the problem will occur again. On the other hand, you're walking over the array already before calling json_encode: array_walk_recursive($array, $callback);. So in that case, you might want to consider calling utf8_encode on the site names in that callback function? Then in general, you'd probably want to think about what else besides the site names could be "broken".

I don't really know a good solution here at the moment. In our case, we'll probably need to patch the Json.php file temporarily with a utf8_encode on all sitenames in the callback function. Or something like that...

EDIT
I temporarily applied the following patch that solves the problem in line 69 of Json.php:

  • before
$value =html_entity_decode($value, ENT_QUOTES, "UTF-8");
  • after
$value = utf8_encode(html_entity_decode($value, ENT_QUOTES, "UTF-8"));

I don't really know what side-effects this has, especially on performance. The request really really long. But I had this problem without the encoding issues as well. I guess this is the same as the problem here:
http://forum.piwik.org/read.php?2,109904

@mattab
Copy link
Member

mattab commented Oct 28, 2014

@ataraxie thanks for the details. I'd like to ask you to test something. In your [database] section in config.ini.php can you add the following:

charset = utf8

Then try again to add a new website with utf characters. Are the names recorded valid this time? is the bug still the same for data added to your piwik after changing the config file?

@ataraxie
Copy link
Author

ataraxie commented Nov 3, 2014

@mattab I cannot believe it - this is it! What I did is:

  • removed my patch with utf8_encode
  • reloaded page => no results shown
  • added your line in the config
  • reloaded page => results are shown!

I also added a new website with name "äää" like I did before using the API and it works too! So I really think this solves the issue. Thank you!!

Besides that, I now still have the problem that the load process for the websites takes really long (~ 1 min) and I get "long running script" errors in browsers. Maybe you could also give me some directions how to proceed here? My archive script is running every 40 minutes without errors.

@mattab
Copy link
Member

mattab commented Nov 4, 2014

Besides that, I now still have the problem that the load process for the websites takes really long

please setup: http://piwik.org/docs/setup-auto-archiving/

regarding the initial problem, I'm wondering if we could detect that the Mysql does not run under proper unicode and automatically set the setting when it's the case?

@mattab mattab added Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. and removed Bug For errors / faults / flaws / inconsistencies etc. labels Nov 4, 2014
@mattab mattab modified the milestones: Mid term, Short term Nov 4, 2014
@ghost
Copy link

ghost commented Nov 5, 2014

I found your solution here in GitHub for this kind of Problem: adding the line: charset = utf8 in the database section of config.ini.php.

I did this as well and the websites all reappeared in the settings.

But when I now look at my data the page titels are not displayed correct and look like this:

Broschüren --> should be: Broschüren
Spaßbäder --> should be: Spaßbäder

The database is utf-8 and the pages are all utf-8 as well.

What can I do?

Best regards

Christoph

@ghost
Copy link

ghost commented Nov 6, 2014

No its gone. Maybe it was a caching thing.

@mattab
Copy link
Member

mattab commented Feb 23, 2015

Again today on the forums someone reported that adding charset=utf8 solved a problem for them. Maybe we should investigate this and really understand the problem. Or maybe we should simply add charset=utf8 to the config by default, to prevent such issue all together?

at http://forum.piwik.org/read.php?2,124264

@mattab mattab added Bug For errors / faults / flaws / inconsistencies etc. and removed Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change. labels Feb 23, 2015
@mattab mattab modified the milestones: Short term, Mid term Feb 23, 2015
@mattab mattab changed the title Empty JSON response if Piwik sites have umlauts On databases that have a default charset that is not utf-8, data can get corrupted Apr 9, 2015
@mattab mattab added the c: Performance For when we could improve the performance / speed of Matomo. label Apr 9, 2015
@mattab
Copy link
Member

mattab commented Apr 9, 2015

The goal of this issue is to estimate whether there is any performance overhead when we add charset = utf8 in the global config file by default. This will result in running SET NAMES query on each request - is there is no overhead doing this, let's add it for sure as it make Piwik work well across more DB configs!

(because we told users of the workaround charset=utf8 many times in the forums in the past, this little change would help much. added Performance tag and renamed issue)

@mattab mattab added the Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. label Apr 10, 2015
@tsteur
Copy link
Member

tsteur commented Apr 21, 2015

It takes on my instance about 60 microsecs with Mysqli and 100 microsecs with PDO so should not cause any problems. Not sure how it behaves under load but don't think that it causes any problems.

Maybe the Tracker\DB could be improved as it does a separate set names query https://github.com/piwik/piwik/blob/2.13.0-b3/core/Tracker/Db/Pdo/Mysql.php#L92-L95 whereas in normal DB it is handled directly in DSN and driver option https://github.com/piwik/piwik/blob/2.13.0-b3/libs/Zend/Db/Adapter/Pdo/Mysql.php#L85-L107

I presume second version (dsn + init command) should be even faster

@mattab
Copy link
Member

mattab commented Apr 21, 2015

Maybe the Tracker\DB could be improved as it does a separate set names query whereas in normal DB [...]

btw reading your comment i remembered we wanted to consider: Unify Tracker/Db and Db into one? #7727

@saleemkce
Copy link
Contributor

Yes, I have had bad experiences in the past in OSIpage app where users would post lots of URLs from different languages. We had some key tables where URLs were stored, were using default latin1 charset which seemed to garble URLs of foreign languages.

The fix had been quite simple. We need to make sure that,

  • when creating database, we should use database collation as either "utf8_general_ci" or "utf8_unicode_ci"
    -> utf8_general_ci is a little bit faster than utf8_unicode_ci
    -> utf8_unicode_ci is more accurate than utf8_general_ci according to myql documentation.
    -> the best bet may be using "utf8_unicode_ci" as we would always prefer accuracy over small performance improvement.
  • all tables should use charset as utf8 and collate as utf8_unicode_ci

In other words, we could use this mysql command to change a table's charset one at a time.

ALTER TABLE myTable CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • any website displaying content other than English, should have this in HTML head tag so that it handles other language characters properly & correctly on web page.
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">

After applying these changes in OSIpage app, we didn't ever come across URL garble issues again. I hope this solves the issue.

@mattab mattab modified the milestones: 2.15.0, Short term Jul 14, 2015
@mattab
Copy link
Member

mattab commented Jul 14, 2015

It takes on my instance about 60 microsecs with Mysqli and 100 microsecs with PDO so should not cause any problems. Not sure how it behaves under load but don't think that it causes any problems.

@tsteur Awesome! moving into 2.15.0

tsteur added a commit that referenced this issue Jul 29, 2015
@tsteur tsteur self-assigned this Jul 29, 2015
tsteur added a commit that referenced this issue Jan 14, 2016
In #6497 we set the charset in the config to utf8 by default. This caused that latin1 was not set anymore and causes failures on MySQL 5.7. By forcing charset latin1 the load data infile of archives works again.
@yashweb3
Copy link

I found your solution here in GitHub for this kind of Problem: adding the line: charset = utf8 in the database section of config.ini.php.

I did this as well and the websites all reappeared in the settings.

But when I now look at my data the page titels are not displayed correct and look like this:

Broschüren --> should be: Broschüren
Spaßbäder --> should be: Spaßbäder

The database is utf-8 and the pages are all utf-8 as well.

What can I do?

Best regards

Christoph

i am getting same problem what was the solution :)

@yashweb3
Copy link

I found your solution here in GitHub for this kind of Problem: adding the line: charset = utf8 in the database section of config.ini.php.

I did this as well and the websites all reappeared in the settings.

But when I now look at my data the page titels are not displayed correct and look like this:

Broschüren --> should be: Broschüren
Spaßbäder --> should be: Spaßbäder

The database is utf-8 and the pages are all utf-8 as well.

What can I do?

Best regards

Christoph

Did you find any solution for it?

@matthewhutchings
Copy link

This is still an issue.

@mattab
Copy link
Member

mattab commented Nov 8, 2022

I'm not sure if it's helpful, but this is the FAQ we have that explains how to convert a Matomo DB to utf8mb4 charset, which is what we've been using by default https://matomo.org/faq/how-to-update/how-to-convert-the-database-to-utf8mb4-charset/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. c: Performance For when we could improve the performance / speed of Matomo. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

7 participants