I'm meaning this query https://github.com/matomo-org/matomo/blob/4.5.0-b2/plugins/Ecommerce/VisitorDetails.php#L154-L169 which looks like this
SELECT COALESCE(SUM(ROUND(revenue,2)), 0) as lifeTimeRevenue, COUNT(*) as lifeTimeConversions,COALESCE(SUM(ROUND(items,2)), 0) as lifeTimeEcommerceItems FROM log_visit AS log_visit LEFT JOIN log_conversion AS log_conversion ON log_visit.idvisit = log_conversion.idvisit WHERE log_visit.idsite = '2' AND log_visit.idvisitor = 'VètO' AND log_conversion.idgoal = 0
At first glance the query looks good as it uses indexes etc.
This is usually also the case
However, things change when one visitor has hundreds or thousands of visits. Then it looks like this:
Here is more explain format=json information:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "365.16"
},
"nested_loop": [
{
"table": {
"table_name": "log_conversion",
"access_type": "ALL",
"possible_keys": [
"PRIMARY"
],
"rows_examined_per_scan": 738,
"rows_produced_per_join": 73,
"filtered": "10.00",
"cost_info": {
"read_cost": "261.84",
"eval_cost": "14.76",
"prefix_cost": "276.60",
"data_read_per_join": "2M"
},
"used_columns": [
"idvisit",
"idgoal",
"items",
"revenue"
],
"attached_condition": "((`test`.`log_conversion`.`idgoal` = 0) and (`test`.`log_conversion`.`idvisit` is not null))"
}
},
{
"table": {
"table_name": "log_visit",
"access_type": "eq_ref",
"possible_keys": [
"PRIMARY",
"index_idsite_config_datetime",
"index_idsite_datetime",
"index_idsite_idvisitor"
],
"key": "PRIMARY",
"used_key_parts": [
"idvisit"
],
"key_length": "8",
"ref": [
"test.log_conversion.idvisit"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 3,
"filtered": "5.00",
"cost_info": {
"read_cost": "73.80",
"eval_cost": "0.74",
"prefix_cost": "365.16",
"data_read_per_join": "113K"
},
"used_columns": [
"idvisit",
"idsite",
"idvisitor"
],
"attached_condition": "((`test`.`log_visit`.`idvisitor` = (@`id`)) and (`test`.`log_visit`.`idsite` = '2'))"
}
}
]
}
}
FYI Overall, in that specific case, there is actually no entry that matches idsite = 2 and idgoal = 0
in log_conversion
.
refs https://github.com/matomo-org/matomo/issues/16904
Not sure if we can improve this query, or maybe not execute it in the first place. Like if we are requesting the data for only one site, we could eg check if ecommerce
is enabled for that site. In this specific example where we had this problem it would have fixed it. Maybe we can also find a way though to further improve this.
For more details, or if you want me to test queries please ping me.
To summarise the todo list
where idvisitor = ?
and join other tableswhat seems to have fixed this for me was
SELECT /*+ JOIN_FIXED_ORDER() */ COALESCE(SUM(ROUND(revenue,2)), 0) ...
or also using STRAIGHT_JOIN
vs left join
but not sure if there's any issue with that .
It changes the plan to
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "4155.07"
},
"nested_loop": [
{
"table": {
"table_name": "log_visit",
"access_type": "ref",
"possible_keys": [
"PRIMARY",
"index_idsite_config_datetime",
"index_idsite_datetime",
"index_idsite_idvisitor"
],
"key": "index_idsite_idvisitor",
"used_key_parts": [
"idsite",
"idvisitor"
],
"key_length": "12",
"ref": [
"const",
"const"
],
"rows_examined_per_scan": 2798,
"rows_produced_per_join": 2798,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "7.82",
"eval_cost": "559.60",
"prefix_cost": "567.42",
"data_read_per_join": "84M"
},
"used_columns": [
"idvisit",
"idsite",
"idvisitor"
],
"attached_condition": "(`test`.`log_visit`.`idvisitor` = (@`id`))"
}
},
{
"table": {
"table_name": "log_conversion",
"access_type": "ref",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"idvisit",
"idgoal"
],
"key_length": "12",
"ref": [
"test.log_visit.idvisit",
"const"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 3948,
"filtered": "100.00",
"cost_info": {
"read_cost": "2798.00",
"eval_cost": "789.65",
"prefix_cost": "4155.07",
"data_read_per_join": "144M"
},
"used_columns": [
"idvisit",
"idgoal",
"items",
"revenue"
]
}
}
]
}
}
Below some performance tests. Seems STRAIGHT_JOIN
(last 5 entries) performed fastest vs the other ones maybe actually weren't faster.
@tsteur Two ideas came to mind here. I can't verify if either will work without experimentation, but just something else to consider.
idsite
and idvisitor
(in the order that best reduces the set) for the log_visit table.@samjf there is already such an index on the log_visit table, but because it matches so many rows (like 1000) it seems to prefer a different strategy and not use it.
Didn't realise we execute it twice 👍 It should be using a prepared statement already AFAIK although the idgoal could be a parameter too. Generally though it's interesting we could combine those two queries into one as it shouldn't make it slower like
SELECT log_conversion.idgoal, ... WHERE ... (log_conversion.idgoal = 0 or log_conversion.idgoal = -1) group by log_conversion.idgoal
Something to be careful of when we test this:
SQL_NO_CACHE
The server does not use the query cache. It neither checks the query cache to see whether the result is already cached, nor does it cache the query result. (Due to a limitation in the parser, a space character must precede and follow the SQL_NO_CACHE keyword; a nonspace such as a newline causes the server to check the query cache to see whether the result is already cached.)
https://dev.mysql.com/doc/refman/5.6/en/query-cache-in-select.html
Interestingly "a space character must precede and follow the SQL_NO_CACHE keyword"
@samjf that's good to know! could you mention this on https://developer.matomo.org/guides/profiling-code#profiling-a-specific-query where we talk about SQL_NO_CACHE? There is an edit button on the bottom right which allows you to edit the page using markdown. You can edit it directly in the Github UI. Then you can create a PR from this change by setting a branch name
@tsteur Sure thing! That is a good idea. I'll put a notice there too.