@mattab opened this Issue on January 14th 2016 Owner

The problem

It can take more than 1000s to execute a single query with action segment:

SELECT
                count(distinct log_inner.idvisitor) AS `1`, 
            count(distinct log_inner.user_id) AS `39`
            FROM

        (

            SELECT
                log_visit.idvisitor, 
log_visit.user_id
            FROM
                piwik_log_visit AS log_visit
                LEFT JOIN piwik_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.visit_last_action_time >= '2015-07-01 07:00:00'
                AND log_visit.visit_last_action_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )
            GROUP BY
                log_visit.idvisit
            ORDER BY
                NULL
        ) AS log_inner;

Explain extended with a standard DB (vanilla Piwik):

id  select_type table   type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2   100.00  
2   DERIVED log_visit   range   PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor   index_idsite_datetime   12  NULL    1   100.00  Using index condition; Using where; Using temporary
2   DERIVED log_link_visit_action   ref index_idvisit   index_idvisit   4   piwik4437.log_visit.idvisit 1   100.00  Using where

We can see that it's starting from log_visit and index_idsite_datetime is being used. So in such case, this query will be quite fast as it's limiting the data set using date range.

Then this query for a big server with added additional indexes to speed up transitions report. Here's the result:

id  select_type table   type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    16765884    100.00  
2   DERIVED log_link_visit_action   ref index_idvisit,transitions_url   transitions_url 5   const   16765884    100.00  Using temporary
2   DERIVED log_visit   eq_ref  PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor   PRIMARY 4   piwik3293.log_link_visit_action.idvisit 1   100.00  Using where

In this case MariaDB is trying to use transitions_url index however it's not being used at all.

A faster query

Our SQL query above says:

            WHERE
                ( log_visit.visit_last_action_time >= '2015-07-01 07:00:00'
                AND log_visit.visit_last_action_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )

If we replaced the WHERE statements with following:

            WHERE
                ( log_link_visit_action.server_time >= '2015-07-01 07:00:00'
                AND log_link_visit_action.server_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )

Then the query takes 15 seconds to complete.

explain extended says that there are only 2M rows to scan:

1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2261504 100.00  
2   DERIVED log_link_visit_action   range   index_idvisit,index_idsite_servertime,transitions_url   transitions_url 17  NULL    2261504 100.00  Using index condition; Using temporary
2   DERIVED log_visit   eq_ref  PRIMARY PRIMARY 4   piwik3293.log_link_visit_action.idvisit 1   100.00  Using where

Solution

Therefore the solution seems to be that, when log_link_visit_action is used in the query (ie. an action segment is used):

  • replace AND log_visit.idsite IN ('1') into AND log_link_visit_action.idsite IN ('1')
  • replace AND log_visit.visit_last_action_time by AND log_link_visit_action.server_time

    -> then the transitions_url INDEX would be used.

This would lead to massive performance improvements, where currently SQL query that often takes 30minutes would be then only 15s or so.

Notes

Thanks @quba for the find & report :+1:

@quba commented on January 14th 2016 Contributor

This requires further investigation as without the transition keys it still may be slow. Before implementing, we have to test many cases and most likely not only with this single query.

@mattab commented on January 15th 2016 Owner

we may have to postpone to 2.16.1 but let's see!

@gebi commented on May 4th 2017

Any news on this?
I'm not sure we hit the same problem, but we have constant problems that piwik queries take >30min to complete on bigger instances (for just a single month and everything in memory in the mysql server).

@mattab commented on June 21st 2017 Owner

@gebi can you post here, or in a separate issue, the slow queries from your mysql slow query log?

@eramirezprotec commented on January 29th 2018 Contributor

Hi, @mattab. Did you manage to solve this issue? I have a Matomo installation running a few sites with a LOT of traffic. When I create segments, it takes forever to complete the archiving process. The query we detected as the problem was exactly the one you exposed in the first comment of this post.

Original query (32.26 sec):

SELECT count(DISTINCT log_inner.idvisitor) AS '1', count(DISTINCT log_inner.user_id) AS '39' FROM ( SELECT log_visit.idvisitor, log_visit.user_id FROM piwik_log_visit AS log_visit LEFT JOIN piwik_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit WHERE (log_visit.visit_last_action_time >= '2017-09-30 22:00:00' AND log_visit.visit_last_action_time <= '2017-10-31 22:59:59' AND log_visit.idsite IN ('13')) AND (log_link_visit_action. custom_dimension_4 = '/tv-y-comunicacion') GROUP BY log_visit.idvisit ORDER BY NULL ) AS log_inner;

New query (0.01 sec [it seems it could solve the problem, but we have not implemented it on our Matomo]):

SELECT count(DISTINCT log_inner.idvisitor) AS '1', count(DISTINCT log_inner.user_id) AS '39' FROM ( SELECT log_visit.idvisitor, log_visit.user_id FROM piwik_log_visit AS log_visit, (SELECT idvisit FROM piwik_log_link_visit_action WHERE idsite=13 AND custom_dimension_4 = '/tv-y-comunicacion') AS log_link_visit_action WHERE log_link_visit_action.idvisit = log_visit.idvisit AND (log_visit.visit_last_action_time >= '2017-09-30 22:00:00' AND log_visit.visit_last_action_time <= '2017-10-31 22:59:59' AND log_visit.idsite IN ('13')) GROUP BY log_visit.idvisit ORDER BY NULL ) AS log_inner;

@tsteur commented on January 29th 2018 Owner

FYI: I don't think that query would be faster for most installations as this query can select heaps of rows (in worst case basically selects like all rows when no other sites are used on the same DB):

SELECT idvisit FROM piwik_log_link_visit_action WHERE idsite=13 AND custom_dimension_4 = '/tv-y-comunicacion'

It is missing something like server_time >= ... && server_time <= ... directly in that query (and not in the outer query)

@mattab commented on January 30th 2018 Owner

One thing that you could try is to enable the subquery segment cache which we may enable in the future see #8867

in your config/config.ini.php:

[General]
; whether to enable subquery cache for Custom Segment archiving queries
enable_segments_subquery_cache = 1
; Any segment subquery that matches more than segments_subquery_cache_limit IDs will not be cached,
; and the original subquery executed instead.
segments_subquery_cache_limit  = 100000
; TTL: Time to live for cache files, in seconds. Default to 60 minutes
segments_subquery_cache_ttl  = 3600
Powered by GitHub Issue Mirror