Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in connection character encoding #984

Closed
anonymous-matomo-user opened this issue Sep 15, 2009 · 22 comments
Closed

Problem in connection character encoding #984

anonymous-matomo-user opened this issue Sep 15, 2009 · 22 comments
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@anonymous-matomo-user
Copy link

Varchar columns are set in utf8_general_ci in piwik database when installing. I notice there is a problem when I insert accented characters. For example if you try to customize the page name displayed in JavaScript tracking, for example:

piwikTracker.setDocumentTitle("Home /Tlcom/Professionnels/INTEGRAL Pro/Avantages VIP");

In piwik_log_action table, the record is saved as:
Home/Tlcom/Professionnels/INTEGRAL Pro/Avantages Mobile

It seems the problem comes from the fact the connection isn't set as utf8. PDO doesn't care about charsets. So If you want to have your connection in unicode / utf-8 or any other encoding, you'll have to tell your database, for example using $dbh->exec('SET CHARACTER SET utf8') (mysql).

So I tested to add the following instruction on DB connect, just after establishing a new connection:

$this->connection = new PDO($this->dsn, $this->username, $this->password);
$this->connection->exec('SET CHARACTER SET utf8');
@robocoder
Copy link
Contributor

What version of MySQL? (There was a problem reported in the forum for MySQL 4.1.7-nt.)

In core/Piwik.php's createDatabaseObject() function, please try one of these:

$db = Zend_Db::factory($config->database->adapter, $dbInfos);
$db->getConnection();

$db->getConnection()->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");

or:

$dbInfos['driver_options'] = array(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");

$db = Zend_Db::factory($config->database->adapter, $dbInfos);
$db->getConnection();

@robocoder
Copy link
Contributor

And of course, make a similar change in core/Tracker/Db.php's connect() function.

@anonymous-matomo-user
Copy link
Author

I tried on Mysql 5.0.33

In core/Tracker/Db.php's connect() function, I've tried:

$this->connection = new PDO($this->dsn, $this->username, $this->password);
$this->connection->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES 'utf8'");

But It does not works. But if I test:

$dbInfos = array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'");

$this->connection = new PDO($this->dsn, $this->username, $this->password, $dbInfos);

Then it works.

@robocoder
Copy link
Contributor

Thanks.

@robocoder
Copy link
Contributor

In [1473]:

fixes #904 - MySQL error codes; unsupported adapters can map these to driver-specific SQLSTATE
fixes #980 - Piwik Installation support for "MySQL Improved" (mysqli) extension
fixes #984 - Set client connection charset to utf8.

Fixed tracker profiling data not recorded until after report generated.

More refactoring and database abstraction:

  • Installation gets a list of adapters instead of hardcoding in the plugin
  • checking for database-specific system requirements deferred to the adapter
  • error detection moved to adapter but we still use MySQL error codes rather than defining new constants

Note: unit tests don't run with MYSQLI -- Zend Framework's Mysqli adapater doesn't support prepare() yet

@mattab
Copy link
Member

mattab commented Sep 22, 2009

It is a bit annoying that we have to do this. I am not entirely sure of the performance overhead of running this query for each page view, but this might be costly.

Is this really the only way to properly record the data in the tables? It is unclear to me why when the table field is set to UTF-8, with UTF-8 data coming in, the data would not be properly recorded?

@robocoder
Copy link
Contributor

From my testing and reading the MySQL documentation, it looks like the storage and connection charsets settings are independent of each other.

From http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html (emphasis added):
For the per-database or server-startup techniques, the settings control the character set for data storage. Applications must also tell the server which character set to use for client/server communications [...]

@robocoder
Copy link
Contributor

raziel057: in your mysql client, can you run show variables, and tell us what the values of these are:

  • character_set_client
  • character_set_connection
  • character_set_results
  • character_set_server

Thanks.

@anonymous-matomo-user
Copy link
Author

Yes storage and connection chartsets settings are independent.

Here my settings:

character_set_client : utf8
character_set_connection : utf8
character_set_database : utf8
character_set_results : utf8
character_set_server : utf8
character_set_system : utf8

collation_connection : utf8_unicode_ci
collation_database : utf8_unicode_ci
collation_server : utf8_unicode_ci

@robocoder
Copy link
Contributor

Why is it necessary to set the connection char set if everything appears to be defaulting to utf8? Are these your settings before the code change?

@anonymous-matomo-user
Copy link
Author

Yes I had the same settings before the code change. I don't know really why we must force the connection charset at connection but I noticed there is problems if it's not case...

Can you test?

@robocoder
Copy link
Contributor

My server defaults to latin1.

If I set the connection to latin1 (or let it default to this), and insert into a utf8 table something like '', it gets stored as (I using the hex() function in my select):

| C3A5C5A0C2A0C3A8C2BDC2BDC3A6E280A2C2B0C3A6C28DC2AEC3A4C2B8C2AD |

This isn't utf8 but a select gives back the expected result, so there's conversion magic at work.

If I set the connection to utf8, the same string is stored as:

| E58AA0E8BDBDE695B0E68DAEE4B8AD |

This is utf8 and select returns the original string. However, the previoulsly stored string is now munged in the select result.

So, my opinion is that the SET NAMES fix follows the MySQL documentation. The storage is more compact and from an ongoing support perspective, it means consistent behaviour for all users.

However, while it's ok for a new database, it's a potential problem for existing users and their datasets as there seems to be no easy way to fix the stored data.

Given http://bugs.mysql.com/bug.php?id=19292, there may be some library/driver-level issue at play here.

raziel057: Please make a copy of index.php and test this mod to see if there's a difference between the mysql client and the php extension:

78c78,82
<   $controller->dispatch();
---
> //    $controller->dispatch();
>   Piwik::createConfigObject();
>   Piwik::createDatabaseObject();
>   $rows = Piwik_FetchAll("SHOW VARIABLES LIKE 'character%'");
>   var_dump($rows);

@anonymous-matomo-user
Copy link
Author

I executed this code:

$controller->dispatch();
Piwik::createConfigObject();
Piwik::createDatabaseObject();
$rows = Piwik_FetchAll("SHOW VARIABLES LIKE 'character%'");
echo "<pre>"
var_dump($rows);
echo "</pre>";

As result:

array(8) {
  [0]=>
  array(2) {
    ["Variable_name"]=>
    string(20) "character_set_client"
    ["Value"]=>
    string(6) "latin1"
  }
  [1]=>
  array(2) {
    ["Variable_name"]=>
    string(24) "character_set_connection"
    ["Value"]=>
    string(6) "latin1"
  }
  [2]=>
  array(2) {
    ["Variable_name"]=>
    string(22) "character_set_database"
    ["Value"]=>
    string(4) "utf8"
  }
  [3]=>
  array(2) {
    ["Variable_name"]=>
    string(24) "character_set_filesystem"
    ["Value"]=>
    string(6) "binary"
  }
  [4]=>
  array(2) {
    ["Variable_name"]=>
    string(21) "character_set_results"
    ["Value"]=>
    string(6) "latin1"
  }
  [5]=>
  array(2) {
    ["Variable_name"]=>
    string(20) "character_set_server"
    ["Value"]=>
    string(4) "utf8"
  }
  [6]=>
  array(2) {
    ["Variable_name"]=>
    string(20) "character_set_system"
    ["Value"]=>
    string(4) "utf8"
  }
  [7]=>
  array(2) {
    ["Variable_name"]=>
    string(18) "character_sets_dir"
    ["Value"]=>
    string(26) "/usr/share/mysql/charsets/"
  }
}

@anonymous-matomo-user
Copy link
Author

Thanks raziel05. That confirms that the php extensions are not negotiating the charset for the connection; they're using the charset of the server build (ignoring my.cnf).

I'm going to make the following change:

  • Installation: if character set connection is not utf8, add a charset=utf8 to config
  • MySQL adapters: would only SET NAMES if charset is set in config

Existing users would not be (immediately) affected. latin1 users would still have inefficient storage of non-ASCII text.

Existing users could add this line to their config manually, or rebuild mysql with utf8 as the default charset.

@mattab
Copy link
Member

mattab commented Sep 24, 2009

apang (not vipsoft?), I am not sure this is the best solution - if the server configuration changes in between the time Piwik was installed and today, data would not be recorded the same. We should aim to have consistent behavior independently of the server configuration. Vipsoft, your comment seems to show that independantly of the connexion character set, data is stored and retrieved without error. raziel057 says he notices errors if the connexion character set is not set - were you able to reproduce his problem?

@robocoder
Copy link
Contributor

Hmm. I think I auto-logged in under an old user Id...I don't use that computer often except for IIS testing.

Re: proposed change. It would only fail if a server build went from utf8 default to non-utf8.

Re: mismatched charsets. MySQL appears to handle it in my limited testing and charset combination.

Re: raziel057's test. I can reproduce this but I'm not sure if there are cases where a conversion fails, thus resulting in lost pageviews.

Let's defer this to 0.5. I'll revert or comment out the changes.

To investigate:

  • brute force test of inserting utf8 sequences over a latin1 connection
  • converting specific columns from pseudo-utf8 to true utf8

@robocoder
Copy link
Contributor

In [1484], refs #984 - adapters no longer default to utf8 connection

add utf8 connection charset test -- the warning is currently suppressed pending further investigation (previously noted in this ticket)

in the interim, users can manually add 'charset = utf8' to their config.ini.php to get the desired behaviour

fix bug where skipping step in installation munged session var and a browser restart was necessary

fill in default port from adapter

restore emulated prepares for pdo_mysql to facilitate use of query cache

@mattab
Copy link
Member

mattab commented Nov 2, 2009

my testing, after applying patch in #708, showed that calling setDocumentTitle with accents resulted in the accents being displayed properly in the Page reports.

Anthon, my quick read of the commits show that no extra SQL query is made in the Tracker - can you please confirm?

Closing for now. Please reopen if there were any extra logic introduced because of this ticket that probably shouldn't be in the codebase.

@robocoder
Copy link
Contributor

Summary:

  • Default behaviour is no extra query (ie set name).
  • Users can add "charset = utf8" to their config.ini.php [database] settings manually.
  • However, for compactness and portability (eg using the mysql client), the extra "set name" query should be the default.

ToDo:

  • Conversion script
  • Automatically add "charset = utf8" to config.ini.php

@robocoder
Copy link
Contributor

In [1595], add compat function if mysqli_set_charset doesn't exist

@robocoder
Copy link
Contributor

Observations:

  • Drupal & Joomla always SET NAMES
  • Wordpress will SET NAMES if DB_CHARSET is defined in the config file

@robocoder
Copy link
Contributor

Note: I ran an exhaustive test (per comment:17) -- U+0000 thru U+10FFFF - and MySQL store/fetch ran without error.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

3 participants