@mattab opened this Issue on February 15th 2016 Member

When we will require MySQL 5.5.3 in Piwik, it would be desired to make our database use utf8mb4 collation. This would solve some possibly weird issues and in general, is a best practise for properly managing unicode characters in MySQL.

Background information utf8mb4

@mattab commented on March 13th 2016 Member

This was originally suggested by @EgiX who recommends us to implement this improvement sooner rather than later. Requires MySQL > 5.5.3 - discussion to increase MySQL requirement covered in https://github.com/piwik/piwik/issues/9107

@ThaDafinser commented on June 13th 2016 Contributor

:+1: for this.

Just switched to MySQL 5.7 and it works fine with the new utf8mb4.

Also they plan to make this value the default in 5.8 maybe
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/

@slawa-dev commented on March 28th 2019

I stumbled upon a problem where custom value was showing �����. Looking through the possible causes I found out that Matomo changed only to utf8_unicode_ci.
utf8_unicode_ci does not cover all unicode characters, but only those with up to 3-bytes. To cover all characters utf8mb4_unicode_ci must be used.

Can I just change the collation of database table to utf8mb4_unicode_ci? Will it cause any issues?
I guess I have to change some Matomo code too.
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
But then it will break on every update.

@alexhass commented on March 28th 2019

This change require single file per table. Such a change is extreme problematic as many vservers have limis with open file handles and will run out of resources. I tried on my server and was made to revert... and my vserver has high limits... there are other hosters that have just 1/3 of mine.

@Findus23 commented on March 28th 2019 Member

@alexhass Can you expand what the encoding/collation of a database table has to do with the amount of files needed?
I'd expect MySQL/MariaDB to be able to handle utf8mb4_unicode_ci databases properly as they are using it as a default (I think).
But I also don't know that much about the details.

@alexhass commented on March 28th 2019

To be able to use utf8mb4 with php you have requirements. One is that you need to reconfigure mysql to use one file per table (innodb_file_per_table=true). This increases the files from one large db file to thousands. As mysql opens all files and keep them open you will run out of file handles and get an error „Too many open files“. Than mysql cannot start and you are offline.

At https://www.drupal.org/project/utf8mb4_convert you can see what is required to make utf8mb4 working. I wish I could convert my drupal sites but I cannot because of thousands of open file handles. The hoster has limited to a number where one drupal install may be too much... note every single mysql table requires at least two file handles.

@Findus23 commented on March 28th 2019 Member

Hi,

Indeed on my MariaDB instance innodb_file_per_table seems to be enabled and indeed there are two files per table.
But even assuming you are running Matomo since 10 years ago (so 10122=240 files) and are using as many plugins as I am (56 other tables), that just means you have 352 files open which is nothing compared to the default global limit I found (65536) or the limit on my vServer (1597359).
Even the per-process limit of 1024 should be enough to run Matomo for many years without changing any system config.

Which brings me to the question of why the limits are so low on your server and why you can't increase them just high enough.

@alexhass commented on March 28th 2019

I‘m not only running matomo on the server! Also keep in mind the operating system, dovecot, postfix, apache, every php file include and so on - all need additional file handles.

Mine is based on Virtuozzo virtualization and has 8gb ram, 200gb ssd and a typical limit is 1400 file handles for a 8,85€/month server. That is where vservers start and i‘m not willing to spend more money per month.

The file handle number cannot increased as the hoster force you to this limit. I asked them and was told to book a dedicated server for 40€/month. No no no...

If I‘m not wrong 1blue.de has a limit of ~800 file handles for 10€/month servers. That is common at this pricing and my hoster limits are the highest in Germany for entry level servers. Larger vserver for 16€ have ~2300 file handles.

One Drupal site has maybe 300-500 tables and I‘m running 7 of them.

@sgiehl commented on February 14th 2020 Member

@tsteur @mattab switching to utf8mb4 or at least activating it by default for new installs shouldn't be a problem.

only use utf8mb4 if it's present (in case it could be disabled for some reasons)

that would make it a lot more complicate, as we would need to check wich collation is actually in use before storing 4 byte chars and stuff like that.
Imho it I would be fine to require the collation for Matomo 4 as it should be very unlikely that it's not available...

@sgiehl commented on February 14th 2020 Member

I just tried to implement that, but actually it won't be that easy as simply converting all tables to utf8mb4. We have some tables where a varchar is part of a index. With utf8mb4 each char will need 4 bytes, but the default maximum index length is 767 bytes. So a varchar(200) will already be to big for a index. So either we make those fields smaller or we do not use utf8mb4 charset when it's not needed 🤔
@mattab @tsteur @diosmosis any thoughts?

@tsteur commented on February 15th 2020 Member

Not sure when we have an index on varchar. Would it help to index only first say 30 char of varchar? That might be good anyway to keep index smaller

@sgiehl commented on February 15th 2020 Member
            'site_url'    => "CREATE TABLE {$prefixTables}site_url (
                              idsite INTEGER(10) UNSIGNED NOT NULL,
                              url VARCHAR(255) NOT NULL,
                                PRIMARY KEY(idsite, url)
                              ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'option'        => "CREATE TABLE `{$prefixTables}option` (
                                option_name VARCHAR( 255 ) NOT NULL,
                                option_value LONGTEXT NOT NULL,
                                autoload TINYINT NOT NULL DEFAULT '1',
                                  PRIMARY KEY ( option_name ),
                                  INDEX autoload( autoload )
                                ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'session'       => "CREATE TABLE {$prefixTables}session (
                                id VARCHAR( 255 ) NOT NULL,
                                modified INTEGER,
                                lifetime INTEGER,
                                data TEXT,
                                  PRIMARY KEY ( id )
                                ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'archive_numeric'     => "CREATE TABLE {$prefixTables}archive_numeric (
                                      idarchive INTEGER UNSIGNED NOT NULL,
                                      name VARCHAR(255) NOT NULL,
                                      idsite INTEGER UNSIGNED NULL,
                                      date1 DATE NULL,
                                      date2 DATE NULL,
                                      period TINYINT UNSIGNED NULL,
                                      ts_archived DATETIME NULL,
                                      value DOUBLE NULL,
                                        PRIMARY KEY(idarchive, name),
                                        INDEX index_idsite_dates_period(idsite, date1, date2, period, ts_archived),
                                        INDEX index_period_archived(period, ts_archived)
                                      ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",

            'archive_blob'        => "CREATE TABLE {$prefixTables}archive_blob (
                                      idarchive INTEGER UNSIGNED NOT NULL,
                                      name VARCHAR(255) NOT NULL,
                                      idsite INTEGER UNSIGNED NULL,
                                      date1 DATE NULL,
                                      date2 DATE NULL,
                                      period TINYINT UNSIGNED NULL,
                                      ts_archived DATETIME NULL,
                                      value MEDIUMBLOB NULL,
                                        PRIMARY KEY(idarchive, name),
                                        INDEX index_period_archived(period, ts_archived)
                                      ) ENGINE=$engine DEFAULT CHARSET=$charset
            ",
@tsteur commented on February 16th 2020 Member

I reckon it could be fine for these columns to limit to say 30 or 50 characters if that helps make it work?

@mattab commented on February 17th 2020 Member

Don't know if it's possible to limit to 30-50 characters the INDEX because it's also used in PRIMARY KEY?

it seems maybe we need to limit these fields to 191 chars in length? https://stackoverflow.com/questions/51934422/varchar-primary-key-cannot-be-longer-than-191

@sgiehl commented on February 17th 2020 Member

Limiting to 191 works only for keys that aren't combined with other fields. Otherwise we need to reduce it even more (by the amount of bytes the other field takes).
We could also enforce those columns to use utf8 charset. Guess none of them should contain 4 byte chars nevertheless

@alexhass commented on February 17th 2020

I‘m strongly against mb4 format if not 100% needed. I cannot use matomo anymore than.

My vserver (18€/month) has a max open files limit that disallows me to convert to mb4 at all. mysql server requires two files per table and this means the same file handles once mysql starts up. You can no longer use one large innodb file that contains all tables. I have a medium vserver, but 2500 file handles are used very fast and than the server stops.

A root server is too expensive.

@tsteur commented on February 17th 2020 Member

Cheers @alexhass I had no idea hosters limit these things. Is that happening as soon as one table is using mb4 ? Or would it be fine if a few tracking tables (say 3-5 tables) use mb4?

On the tables above we indeed wouldn't really need MB4 so be easiest to keep them the way they are.

@alexhass commented on February 17th 2020

To run mb4 you need to enable innodb_file_per_table. The setting is global to mysql server and affects all tables. I‘m not aware that any other way exists.

@tsteur commented on February 17th 2020 Member

👍 so we would need to check such a flag and only use MB4 if possible. Would that work in your case?

@slawa-dev commented on February 19th 2020

Users have been waiting for utf8mb4 support for many years. Especially those with non English websites. It will solve a lot of issues.
Extremely limited servers can still run 3.x version of Matomo.
Utf8mb4 is the only way not to exclude certain languages that use character encoding with 4-bytes.

@alexhass commented on February 19th 2020

Tsteuer: maybe.

The general problem with mb4 stays as is. This is not about excluding a language. In general I would be 100% pro mb4, but the current limitations in mysql are really bad. I have no idea why file per table is required by design.... but as I know it is required and my server is not extremly limited!

Just try not enforcing mb4 and leave an option to run without mb4, please.

@alexhass commented on February 19th 2020

Maybe it is possible to detect file per table setting and use this setting to switch between utf8 and utf8mb4.

@slawa-dev commented on February 19th 2020
Server version: 10.1.34-MariaDB MariaDB Server

show variables like 'innodb_file_format';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+
1 row in set (0.00 sec)

https://docs.nextcloud.com/server/15/admin_manual/configuration_database/mysql_4byte_support.html

@sgiehl commented on February 19th 2020 Member

In MySQL 8 utf8mb4 will be the standard collation.

@alexhass is the utf8mb4 charset available on your system, or isn't it listed at all? E.g. does show collation like 'utf8mb4%'; return anything?

Imho we won't need utf8mb4 for all tables, it might be enough to convert the log tables...

@alexhass commented on February 19th 2020

Yes it is available. Maybe you read my comments first. This is not about the mysql does not have the feature. It requires file per table and this causes the kernel to go out of file handles as there are limits.

@sgiehl commented on February 19th 2020 Member

I have read them. Actually I would have expected that utf8mb4 is not available if it would require innodb_file_per_table.
I'm not that deep into that topic, but I've read through the MySQL docs a bit, and it doesn't seem to generally require it. But if we need keys longer than 767 chars, it's a requirement (along with innodb_large_prefix option)

@Findus23 commented on February 19th 2020 Member

@alexhass I get your argument that enabling innodb_file_per_table (which is required for utf8mb4) is a bad idea when one is strongly limited by file handles.

What I don't understand is why one would be limited in file handles. On my vServer (Hetzner 20€/Month, but the same is true for the 3€ per Month server) the limit is (/proc/sys/fs/file-max) 1.6M
(of which 200K are used) and I'm assuming this is the Debian default. If I wanted, I could simply set it to something higher (on my desktop computer it seems to be 9*10^18).

The linux kernel confirms this default of 1.6 million handles for 16GB

  /*
   * One file with associated inode and dcache is very roughly 1K.
   * Per default don't use more than 10% of our memory for files. 
   */

https://serverfault.com/questions/716578/default-value-of-proc-sys-fs-file-max/716591

Increasing the file handler limit by 10K would allow 3000 more MySQL tables which should be enough for Matomo for a while and would worst case (if all of them are used) take up an additional 10MB of memory. So unless you are running on a microcontroller with severe memory limitations file handlers aren't scarce.

And it shouldn't matter to the Hoster what file limit I am using, if I want to spend a few MB more or less of my memory on it, which kernel I am using or even if I want to run OpenBSD or Haiku on my virtual server as thanks to KVM they just forward the low-level commands to the CPU.

I don't think allowing hosters to save single digit megabytes of memory is worth denying people to store data in their native language.

@alexhass commented on February 19th 2020

I‘m on server4you vserver and the limits are the highest vserver limits on the market. 1blu and others have a lot lower limits than server4you has

@slawa-dev commented on February 21st 2020

Just checked a cheap t3a.micro AWS VS with 1GB RAM and default Ubuntu. You get 95.6K file handlers.

@alexhass commented on February 21st 2020

I’m not paying Amazon for underpowered and extreme expensive hosting. Find the 16gb machine with 512gb ssd for 16€ first.

This Issue was closed on May 17th 2020
Powered by GitHub Issue Mirror