Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much faster Reports archiving when Action segment is used, on large Piwik analytics servers #9522

Closed
mattab opened this issue Jan 14, 2016 · 8 comments
Labels
c: Performance For when we could improve the performance / speed of Matomo. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.

Comments

@mattab
Copy link
Member

mattab commented Jan 14, 2016

The problem

It can take more than 1000s to execute a single query with action segment:

SELECT
                count(distinct log_inner.idvisitor) AS `1`, 
            count(distinct log_inner.user_id) AS `39`
            FROM

        (

            SELECT
                log_visit.idvisitor, 
log_visit.user_id
            FROM
                piwik_log_visit AS log_visit
                LEFT JOIN piwik_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.visit_last_action_time >= '2015-07-01 07:00:00'
                AND log_visit.visit_last_action_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )
            GROUP BY
                log_visit.idvisit
            ORDER BY
                NULL
        ) AS log_inner;

Explain extended with a standard DB (vanilla Piwik):

id  select_type table   type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2   100.00  
2   DERIVED log_visit   range   PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor   index_idsite_datetime   12  NULL    1   100.00  Using index condition; Using where; Using temporary
2   DERIVED log_link_visit_action   ref index_idvisit   index_idvisit   4   piwik4437.log_visit.idvisit 1   100.00  Using where

We can see that it's starting from log_visit and index_idsite_datetime is being used. So in such case, this query will be quite fast as it's limiting the data set using date range.

Then this query for a big server with added additional indexes to speed up transitions report. Here's the result:

id  select_type table   type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    16765884    100.00  
2   DERIVED log_link_visit_action   ref index_idvisit,transitions_url   transitions_url 5   const   16765884    100.00  Using temporary
2   DERIVED log_visit   eq_ref  PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor   PRIMARY 4   piwik3293.log_link_visit_action.idvisit 1   100.00  Using where

In this case MariaDB is trying to use transitions_url index however it's not being used at all.

A faster query

Our SQL query above says:

            WHERE
                ( log_visit.visit_last_action_time >= '2015-07-01 07:00:00'
                AND log_visit.visit_last_action_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )

If we replaced the WHERE statements with following:

            WHERE
                ( log_link_visit_action.server_time >= '2015-07-01 07:00:00'
                AND log_link_visit_action.server_time <= '2015-08-01 06:59:59'
                AND log_visit.idsite IN ('1') )
                AND
                ( log_visit.visitor_returning = '0' AND log_link_visit_action.idaction_url = '15' )

Then the query takes 15 seconds to complete.

explain extended says that there are only 2M rows to scan:

1   PRIMARY <derived2>  ALL NULL    NULL    NULL    NULL    2261504 100.00  
2   DERIVED log_link_visit_action   range   index_idvisit,index_idsite_servertime,transitions_url   transitions_url 17  NULL    2261504 100.00  Using index condition; Using temporary
2   DERIVED log_visit   eq_ref  PRIMARY PRIMARY 4   piwik3293.log_link_visit_action.idvisit 1   100.00  Using where

Solution

Therefore the solution seems to be that, when log_link_visit_action is used in the query (ie. an action segment is used):

  • replace AND log_visit.idsite IN ('1') into AND log_link_visit_action.idsite IN ('1')

  • replace AND log_visit.visit_last_action_time by AND log_link_visit_action.server_time

    -> then the transitions_url INDEX would be used.

This would lead to massive performance improvements, where currently SQL query that often takes 30minutes would be then only 15s or so.

Notes

Thanks @quba for the find & report 👍

@mattab mattab added the c: Performance For when we could improve the performance / speed of Matomo. label Jan 14, 2016
@mattab mattab added this to the 2.16.0 milestone Jan 14, 2016
@quba
Copy link
Contributor

quba commented Jan 14, 2016

This requires further investigation as without the transition keys it still may be slow. Before implementing, we have to test many cases and most likely not only with this single query.

@mattab
Copy link
Member Author

mattab commented Jan 15, 2016

we may have to postpone to 2.16.1 but let's see!

@mattab mattab modified the milestones: 2.16.1, 2.16.0 Jan 20, 2016
@mattab mattab modified the milestones: 2.16.1, 3.0.0 Feb 3, 2016
@gebi
Copy link

gebi commented May 4, 2017

Any news on this?
I'm not sure we hit the same problem, but we have constant problems that piwik queries take >30min to complete on bigger instances (for just a single month and everything in memory in the mysql server).

@mattab
Copy link
Member Author

mattab commented Jun 21, 2017

@gebi can you post here, or in a separate issue, the slow queries from your mysql slow query log?

@eramirezprotec
Copy link
Contributor

Hi, @mattab. Did you manage to solve this issue? I have a Matomo installation running a few sites with a LOT of traffic. When I create segments, it takes forever to complete the archiving process. The query we detected as the problem was exactly the one you exposed in the first comment of this post.

Original query (32.26 sec):

SELECT count(DISTINCT log_inner.idvisitor) AS '1', count(DISTINCT log_inner.user_id) AS '39' FROM ( SELECT log_visit.idvisitor, log_visit.user_id FROM piwik_log_visit AS log_visit LEFT JOIN piwik_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit WHERE (log_visit.visit_last_action_time >= '2017-09-30 22:00:00' AND log_visit.visit_last_action_time <= '2017-10-31 22:59:59' AND log_visit.idsite IN ('13')) AND (log_link_visit_action. custom_dimension_4 = '/tv-y-comunicacion') GROUP BY log_visit.idvisit ORDER BY NULL ) AS log_inner;

New query (0.01 sec [it seems it could solve the problem, but we have not implemented it on our Matomo]):

SELECT count(DISTINCT log_inner.idvisitor) AS '1', count(DISTINCT log_inner.user_id) AS '39' FROM ( SELECT log_visit.idvisitor, log_visit.user_id FROM piwik_log_visit AS log_visit, (SELECT idvisit FROM piwik_log_link_visit_action WHERE idsite=13 AND custom_dimension_4 = '/tv-y-comunicacion') AS log_link_visit_action WHERE log_link_visit_action.idvisit = log_visit.idvisit AND (log_visit.visit_last_action_time >= '2017-09-30 22:00:00' AND log_visit.visit_last_action_time <= '2017-10-31 22:59:59' AND log_visit.idsite IN ('13')) GROUP BY log_visit.idvisit ORDER BY NULL ) AS log_inner;

@tsteur
Copy link
Member

tsteur commented Jan 29, 2018

FYI: I don't think that query would be faster for most installations as this query can select heaps of rows (in worst case basically selects like all rows when no other sites are used on the same DB):

SELECT idvisit FROM piwik_log_link_visit_action WHERE idsite=13 AND custom_dimension_4 = '/tv-y-comunicacion'

It is missing something like server_time >= ... && server_time <= ... directly in that query (and not in the outer query)

@mattab
Copy link
Member Author

mattab commented Jan 30, 2018

One thing that you could try is to enable the subquery segment cache which we may enable in the future see #8867

in your config/config.ini.php:

[General]
; whether to enable subquery cache for Custom Segment archiving queries
enable_segments_subquery_cache = 1
; Any segment subquery that matches more than segments_subquery_cache_limit IDs will not be cached,
; and the original subquery executed instead.
segments_subquery_cache_limit  = 100000
; TTL: Time to live for cache files, in seconds. Default to 60 minutes
segments_subquery_cache_ttl  = 3600

@mattab
Copy link
Member Author

mattab commented Jun 7, 2019

see #8867

@mattab mattab closed this as completed Jun 7, 2019
@mattab mattab added the wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it. label Jun 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. wontfix If you can reproduce this issue, please reopen the issue or create a new one describing it.
Projects
None yet
Development

No branches or pull requests

5 participants