@airblag opened this Issue on June 17th 2019

Hi,

My matomo installation is "big" (log_link_visit_action is ~40GB big for 55 days retention, log_visit is ~4GB), and we see over 97% of temporary tables created on disk in our mysql monitoring.

After trying to tune mysql with increasing tmp_table_size without any sucess I noticed that the log_visit table has the referer_url field set as TEXT.

it blocks the use of the memory engine of mysql for tmp tables writing all of them to disk :

https://dev.mysql.com/doc/refman/5.7/en/blob.html

"[...]
Instances of BLOB or TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types (see Section 8.4.4, “Internal Temporary Table Use in MySQL”). Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.
[...]"

I'm not familiar with the code of matomo to look for the exact query producing the tmp_tables...

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

I tried to move the tmpdir of mysql to ramdisk, but since I try to optimize the DB once a week, and optimizing innodb needs a lot of place on /tmp it's always failing to optimize log_link_action_visit except I give more 40GB for my ramdisk...

@tsteur commented on June 18th 2019 Member

@mattab that's interesting. Maybe would also help with log_action.name?

@mattab commented on June 21st 2019 Member

Thanks for the report @airblag

Is it thinkable to somehow convert it to varchar(65535) to avoid having all tmp_tables on disk ? Or would it have a lot of side effects ?

Reckon that it should be fine for more than 99.9% of cases to have only 65535 character long URLs.
@tsteur Maybe something worth investigating?

@tsteur commented on June 21st 2019 Member

@mattab text also has only 65,535 characters

@tsteur commented on June 21st 2019 Member

We will schedule this for Matomo 4. Thanks for this @airblag

@tsteur commented on June 21st 2019 Member

@mattab I'm thinking we could actually already apply this change for new installs and only include the update in Matomo 4. This will save many thousands of migrations later.

@airblag commented on June 24th 2019

I think we should not make a varchar(65535) but something a bit smaller. The whole row in mysql can be maximum 65535 bytes long, so we need to let some place for the other fields in the row. We should make something like varchar(65535 - size(all other columns)) at maximum, and maybe let some place to extend the table with some extra rows for future schema changes.
In my installation, the biggest url was around 3500 chars. Since the RFC doesn't set any limit to an URL (as far as I understand), there might always be some special cases where the referer doesn't fit fully in the referer_url field. But i think that chomping a referer would almost never happen and should not break the stats that much...

@tsteur commented on June 24th 2019 Member

Yeah I reckon be better to set to something like 50 or 60K just to be safe.

@tsteur commented on June 24th 2019 Member

Let's change the type now for new installs, and already add an update for Matomo 4.0 which will be executed once they update to Matomo 4.

@airblag commented on June 26th 2019

We converted both TEXT fields into varchar in our production setup. But it still doesn't seem to reduce anyhow the amount of tables on disk created.
seeing the rate of how the tables are created on disk, it seems to be produced not by archiving jobs but by every request the server receives.

I just checked all the tables of my piwik DB and grepped for text|blob and there are a lot more !

My dirty way of finding out :
echo show tables|mysql piwik|while read table ; do echo "desc $table;"|mysql piwik|grep -iE --color '(TEXT|BLOB)' && echo "======> $table";done

Maybe it would be good to check in the requests updating/inserting log data which text/blobs are used, and first replace only thoses fields with varchar.

I'll try to have a look, but since I never looked into the code of matomo, it might be easier for someone who is familiar with it :)

@tsteur commented on June 26th 2019 Member

Let us know what you find. Wouldn't think it creates a temporary table during tracking but never know.

@airblag commented on June 27th 2019

So, it took my day of work, but I found out which query is producing the temporary table on disk !

``` mysql:piwik> SHOW STATUS LIKE 'Created_tmp%tables';
+-------------------------+-------+
| Variable_name | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 6 |
| Created_tmp_tables | 8 |
+-------------------------+-------+
2 rows in set (0.00 sec)

mysql:piwik> SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' ) OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' ) OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' ) OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' ) GROUP BY type, hash, name;
+----------+------+--------------------------+
| idaction | type | name |
+----------+------+--------------------------+
| 3869297 | 10 | taz.de/!5603531 |
| 1562941 | 10 | TZI |
| 1562939 | 11 | displayed |
| 3131298 | 12 | ARTIKELAUFRUF_ohne_Layer |
+----------+------+--------------------------+
4 rows in set (0.00 sec)

mysql:piwik> SHOW STATUS LIKE 'Created_tmp%tables';
+-------------------------+-------+
| Variable_name | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 7 |
| Created_tmp_tables | 9 |
+-------------------------+-------+
2 rows in set (0.00 sec)


# A few notes/questions about this request :

1. why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.
2. why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and **DON'T PRODUCE TEMPORARY TABLE !!!**. 
3. removing the name from GROUP BY gives also the same result and also **DON'T PRODUCE TEMPORARY TABLE !!!**. 

I grepped through the code and it seems to be produced in `core/Tracker/Model.php`

<a class='mention' href='https://github.com/tsteur'>@tsteur</a> do you have an idea if 2. or 3. can be done without breaking something ?

# Some notes about debugging

For getting it, I tried first to activate debugging and sql_profiling in matomo, but I hit the bug of logging to file which didn't worked. I applied manualy the patch for the https://patch-diff.githubusercontent.com/raw/matomo-org/matomo/pull/14296.patch on plugins/Monolog/config/config.php.

But I couldn't find the correct requests in the log file. Maybe because I use the QueuedTracking plugin ?

Then I just logged every query of mysql for a few seconds (just enough to let QueuedTracking write in the DB):

SET GLOBAL general_log_file = '/var/log/mysql/general.log';
SET GLOBAL general_log = 'ON';
SELECT sleep(30);
SET GLOBAL general_log = 'OFF';

@tsteur commented on June 28th 2019 Member

why is the crc32() of the name checked in addition of the name ? I removed it (and the GROUP BY hash part), and I get the same result.

There is an index on hash but not on name (cause too long and starts with too similar values usually). The hash is only 10 characters so potentially it could generate the same value which is why it probably also adds the name into the query.

why are we selecting MIN(idaction) instead of idaction ? selecting directly idaction gives me the same result and DON'T PRODUCE TEMPORARY TABLE !!!.

There used to be cases where the same action was added twice due to some race condition issues or something. See https://github.com/matomo-org/matomo/issues/6436 . The PR was done in https://github.com/matomo-org/matomo/pull/7112 . Looking at this, it still seems needed. Can you confirm @diosmosis ? Simply because there is no unique key on the DB and duplicates can happen. Wouldn't really know how to avoid it but maybe it be faster to change the createAction() method to do INSERT INTO ACTION... ; and then SELECT where hash = ? type = ? name = ? If two results are there, we would need to remove the one with the highest ID, and return the lowest ID. As actions are inserted more rarely, and the needed select is fast, this may be an option?

Actually... seeing now this is already implemented. So the only reason the min(idaction) is still there because there could be older records from before this fix was applied that could have duplicate actions. Maybe it could be removed now? I think we had some logic to remove duplicate actions in a task or so? @diosmosis

removing the name from GROUP BY gives also the same result and also DON'T PRODUCE TEMPORARY TABLE !!!.

The group by is there for the min to work I reckon.

@diosmosis commented on June 28th 2019 Member

@tsteur We could just select the highest ID in PHP instead of in MySQL, would that work?

@tsteur commented on June 28th 2019 Member

@diosmosis that should work too. Considering there shouldn't be too many entries

@tsteur commented on June 28th 2019 Member

Does that mean we could remove the min() and the group by maybe?

@diosmosis commented on June 28th 2019 Member

I don't think we can just remove it w/o something that deals w/ duplicates (that come from old bugs and from any new bugs), but if it causes problems we shouldn't have it.

@tsteur commented on June 28th 2019 Member

I meant if we change

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash, name;

to something like

SELECT idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  

then we could do some logic like

$idActions = array();
foreach ($rows as $row) {
   $idAction = $row['idaction'];
   $name = $row['name'];
   if (!isset($idActions[$name])) {
      $idActions[$name] = $row;
   } else if ($idAction < $idActions[$name]['idaction']) {
       $idActions[$name] = $row;
   }
}
return array_values($idActions);

Not tested or so... I simply meant if we don't fetch like crazy huge amount of values there, which we never do maybe I think, this could be faster and avoid tmp tables?

@diosmosis commented on June 28th 2019 Member

Yes that's what I meant too.

@airblag commented on June 28th 2019

Ok, so the crc32 is there to make the index work better, but the probability of collision is to high, so we check both hash and name in the where clause ?
the min() is needed because entries could be duplicated.

What I was meaning in 3. was only to remove name from the group by part. It would look like this and not create tmp tables

SELECT MIN(idaction) as idaction, type, name FROM log_action WHERE ( hash = CRC32('taz.de/!5603531') AND name = 'taz.de/!5603531' AND type = '10' )  OR ( hash = CRC32('displayed') AND name = 'displayed' AND type = '11' )  OR ( hash = CRC32('TZI') AND name = 'TZI' AND type = '10' )  OR ( hash = CRC32('ARTIKELAUFRUF_ohne_Layer') AND name = 'ARTIKELAUFRUF_ohne_Layer' AND type = '12' )  GROUP BY type, hash;

since we filtered out possible collisions with the WHERE clause, I guess only grouping by type,hash should work too, or I am still missing something ?

@tsteur commented on June 28th 2019 Member

If you group by type, hash only, then two entries with different names, but the same hash will be grouped together. Of course that would be only the case if in the same where we select two different names that result in the same hash which would be rather unlikely.

@tsteur commented on June 28th 2019 Member

Considering the chances are small that we are fetching a few values that result in the same hash we could apply the quick fix to simply remove the name from the group by. The type is still in the group by as well. So it would need to result in the same hash for a name with the same type. Just looked at a log_action table with 11M entries and there we had 7000 duplicate hashes with 2 different names for the same type. So while the changes that this happens is very low, it can still happen and be better to apply the proper fix mentioned in https://github.com/matomo-org/matomo/issues/14535#issuecomment-506615014 to avoid getting reports for weird issues later.

@tsteur commented on June 28th 2019 Member

created PR https://github.com/matomo-org/matomo/pull/14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

@airblag commented on June 28th 2019

created PR #14584

@airblag can you otherwise confirm that changing the action name to varchar did not help?

maybe it helps for other queries maybe in the archiving process, but it's not really to observe in our graphs with an average of 30 tmp tables/second.

@tsteur commented on June 28th 2019 Member

it be great if you could patch your Matomo with the referenced PR. Feel free to wait though until there was a review.

@airblag commented on June 28th 2019

I just patched with the PR, and the tmp_tables dropped from 30 per seconds to around 0. Looks like it works for me :)

EDIT:
in my installation log_action.name is varchar(2048). Not sure if it would reacts differently with TEXT.

@airblag commented on July 2nd 2019

Just to confirm after a few days, it does the job:

mysql_tmp_tables-week

@tsteur commented on July 2nd 2019 Member

cheers 👍 thanks for letting us know. I reckon it will be maybe still a good idea to migrate these from text to varchar @mattab @airblag ?

@airblag commented on July 2nd 2019

When I look at this page, there are still a plenty of cases where temporary tables might be created : https://dev.mysql.com/doc/refman/5.6/en/internal-temporary-tables.html

But there are only 3 cases where they have to be created on disk :

Presence of a BLOB or TEXT column in the table. This includes user-defined variables having a string value because they are treated as BLOB or TEXT columns, depending on whether their value is a binary or nonbinary string, respectively.
Presence of any string column in a GROUP BY or DISTINCT clause larger than 512 bytes for binary strings or 512 characters for nonbinary strings.
Presence of any string column with a maximum length larger than 512 (bytes for binary strings, characters for nonbinary strings) in the SELECT list, if UNION or UNION ALL is used.

So converting TEXT to VARCHAR might avoid also creating temporary tables on disk in queries done by future features/plugins, even if it's solved for the current release.

In this case it sounds like it was the "GROUP BY name"

@tsteur commented on August 15th 2019 Member

Another reason to use varchar instead of text as it may prevent creating temporary tables in memory:
image

@mattab commented on September 2nd 2019 Member

Be great to migrate these columns to VARCHAR:

  • log_action.name
  • log_visit.referer_url

and there is also:

  • log_conversion.url
@tsteur commented on September 5th 2019 Member

See https://github.com/matomo-org/matomo/pull/14859
I had to change to column size to 20000 as 65000 was too big and reverted the change for log_conversion.url

Powered by GitHub Issue Mirror