@tsteur opened this Pull Request on July 25th 2019 Member

Sometimes the live query can be really slow (like many minutes). It could even end up using all database resources when there is a lot of data in the log tables. Especially when using segments on top. Often the system ends up doing queries like these:

SELECT sub.* FROM (
            SELECT
                log_inner.*
            FROM     
        (
            SELECT
                log_visit.*
            FROM
                log_visit AS log_visit LEFT JOIN log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.idsite in ('1') 
                AND log_visit.visit_last_action_time >= '2019-05-25 23:00:00' )
                AND
                ( ( log_link_visit_action.idaction_url IS NOT NULL AND log_link_visit_action.idaction_url <> '' AND log_link_visit_action.idaction_url <> '0' ) )
            ORDER BY
                idsite DESC, visit_last_action_time DESC LIMIT 0, 4010
        ) AS log_inner
            ORDER BY
                idsite DESC, visit_last_action_time DESC
            ) AS sub
            GROUP BY sub.idvisit
            ORDER BY sub.visit_last_action_time DESC
        LIMIT 401

Where it fetches data over several months. This obviously can take a very long time, and it might create big temporary tables, full table scans, etc. As we're only wanting the most recent results, it be smarter to first look only at the last day and if this query returns enough results, then we can directly return the result making it a LOT faster.

In this PR I'm executing up to 5 queries to get the result instead of doing it in one query:

  • Look one day back
  • Look another 1 week back (without fetching first day again)
  • Look 1 month back (without looking at the 1week + 1day we already fetched again)
  • Look 1 year back (without looking at the data we already requested)
  • Remaining time

If only 1 week is being requested, it would only execute 2 queries. If 2 years are requested, it would execute up to 5 queries but stop as soon as the requested data has been found. So we might be actually only looking at one day compared to 2 years!

The logic can be only done if there is a data limit. Otherwise we need to return all results anyway and this can still be a problem fyi @mattab

It cannot be executed when there is a limit plus an offset @mattab . So it won't solve too many problems possibly. Unless we executed each query twice see this logic:

Imagine we have `Limit 5 Offset 5`
Query 1 finds: 2 rows  => should use Limit 5, Offset 5. The result of the method would return an empty array. Because of the offset, we can't know that 2 rows have been skipped unless we executed every query twice when no result has been returned? First we would need to execute `Limit 5, Offset 5`, then we would need to execute `Limit $offset, Offset 0` to find out how many rows were skipped
Query 2 finds: 0 rows  => should use Limit 5, Offset 3 
Query 3 finds: 7 rows  => should use Limit 5, Offset 3 and basically should skip the first 3 rows and then return the next 4
Query 4 finds: 0 rows: => should use Limit 1, Offset 0: that one is easy... as soon as at least one row has been found we use offset => 0 and limit can be calculated easily too (originalLimit - foundRows).

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

@tsteur commented on July 26th 2019 Member

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

Maybe not the best idea actually.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

Implemented something like this but haven't pushed yet. It be a compromise between things maybe and possibly still be somewhat efficient for high traffic websites.

@diosmosis commented on August 1st 2019 Member

I wonder if this is similar to Db::segmentedFetchAll() and if an approach like that could be used to lower the required resources. For example, check one day, check one week, then query one month at a time, instead of doing a whole year at once.

@tsteur commented on August 2nd 2019 Member

It's quite similar @diosmosis just that Db::segmentedFetchAll() maybe doesn't know anything re limit/offset/bind and has some first/last/step which I don't need.

@diosmosis commented on August 2nd 2019 Member

@tsteur I was actually more just wondering if it would save db resources to go month by month, instead of querying for a whole year (or the whole range). I haven't done any sort of testing, so I have no idea if it would help or not.

@tsteur commented on August 2nd 2019 Member

@diosmosis we were thinking if we go month by month it might be in the end too many queries and taking too long. The though was more like improving performance for high traffic Matomo's. And they likely find eg 500 visitors within the first day / week or month. And if there weren't 500 visitors in the first month, likely they also won't be found in the next months so it might be better to query the remaining date range at once.

We could be a bit "smart" maybe and have some logic if we find eg 25% of the results in one month, then we could assume that we will have all rows found if we query 3-4 more months instead of say the remaining 2 years. Is this maybe something you mean?

@diosmosis commented on August 2nd 2019 Member

I just figured the high load was due to the number of rows being counted overall, and if so going over smaller chunks might be less of an issue. However, if that isn't what's causing the high load, then it wouldn't be a good idea.

Here's another crazy idea that's too difficult to implement: call VisitsSummary.get for multiple days over the last couple months or so, and use the number of visits to in rows to figure out where an offset starts. (I just wanted to write this idea since it seems strange yet workable, I don't think you should work on it :) )

Anyway, code looks good to me. Don't know if @mattab needs to look at it.

@tsteur commented on August 2nd 2019 Member

@diosmosis that might not help when a segment is used and when the segment is not preprocessed for example.

This Pull Request was closed on August 4th 2019
Powered by GitHub Issue Mirror