Make Visitor Log live query more performant #14700

tsteur · 2019-07-25T23:05:17Z

Sometimes the live query can be really slow (like many minutes). It could even end up using all database resources when there is a lot of data in the log tables. Especially when using segments on top. Often the system ends up doing queries like these:

SELECT sub.* FROM (
            SELECT
                log_inner.*
            FROM     
        (
            SELECT
                log_visit.*
            FROM
                log_visit AS log_visit LEFT JOIN log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.idsite in ('1') 
                AND log_visit.visit_last_action_time >= '2019-05-25 23:00:00' )
                AND
                ( ( log_link_visit_action.idaction_url IS NOT NULL AND log_link_visit_action.idaction_url <> '' AND log_link_visit_action.idaction_url <> '0' ) )
            ORDER BY
                idsite DESC, visit_last_action_time DESC LIMIT 0, 4010
        ) AS log_inner
            ORDER BY
                idsite DESC, visit_last_action_time DESC
            ) AS sub
            GROUP BY sub.idvisit
            ORDER BY sub.visit_last_action_time DESC
        LIMIT 401

Where it fetches data over several months. This obviously can take a very long time, and it might create big temporary tables, full table scans, etc. As we're only wanting the most recent results, it be smarter to first look only at the last day and if this query returns enough results, then we can directly return the result making it a LOT faster.

In this PR I'm executing up to 5 queries to get the result instead of doing it in one query:

Look one day back
Look another 1 week back (without fetching first day again)
Look 1 month back (without looking at the 1week + 1day we already fetched again)
Look 1 year back (without looking at the data we already requested)
Remaining time

If only 1 week is being requested, it would only execute 2 queries. If 2 years are requested, it would execute up to 5 queries but stop as soon as the requested data has been found. So we might be actually only looking at one day compared to 2 years!

The logic can be only done if there is a data limit. Otherwise we need to return all results anyway and this can still be a problem fyi @mattab

It cannot be executed when there is a limit plus an offset @mattab . So it won't solve too many problems possibly. Unless we executed each query twice see this logic:

Imagine we have `Limit 5 Offset 5`
Query 1 finds: 2 rows  => should use Limit 5, Offset 5. The result of the method would return an empty array. Because of the offset, we can't know that 2 rows have been skipped unless we executed every query twice when no result has been returned? First we would need to execute `Limit 5, Offset 5`, then we would need to execute `Limit $offset, Offset 0` to find out how many rows were skipped
Query 2 finds: 0 rows  => should use Limit 5, Offset 3 
Query 3 finds: 7 rows  => should use Limit 5, Offset 3 and basically should skip the first 3 rows and then return the next 4
Query 4 finds: 0 rows: => should use Limit 1, Offset 0: that one is easy... as soon as at least one row has been found we use offset => 0 and limit can be calculated easily too (originalLimit - foundRows).

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

tsteur · 2019-07-26T04:55:43Z

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

Maybe not the best idea actually.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

Implemented something like this but haven't pushed yet. It be a compromise between things maybe and possibly still be somewhat efficient for high traffic websites.

plugins/Live/Model.php

diosmosis · 2019-08-01T18:14:22Z

plugins/Live/Model.php

+            }
+        } else {
+            $queries[] = array($dateStart, $dateEnd);
+        }


Here's an idea that might make it a bit faster (assuming I'm reading this correctly). What the code is doing now is this:

check dateStart => dateStart + 1 day

still no visits? check dateStart => dateStart + 7 days

still no visits? check dateStart => dateStart + 30 days

... etc.

In each query we're checking the same rows as the previous one, we could change it to something like:

check dateStart => dateStart + 1 day

still no visits? check dateStart + 1 day => dateStart + 7 days

still no visits? check dateStart + 7 days => dateStart + 30 days

... etc.

@diosmosis it should already do what you suggested. Unless I'm not understanding it right. But basically the code does not fetch the same day or week twice for example.

Ah ok, $virtualEndDate changes, I didn't notice that.

diosmosis · 2019-08-01T18:31:38Z

I wonder if this is similar to Db::segmentedFetchAll() and if an approach like that could be used to lower the required resources. For example, check one day, check one week, then query one month at a time, instead of doing a whole year at once.

tsteur · 2019-08-02T01:57:17Z

It's quite similar @diosmosis just that Db::segmentedFetchAll() maybe doesn't know anything re limit/offset/bind and has some first/last/step which I don't need.

diosmosis · 2019-08-02T02:04:32Z

@tsteur I was actually more just wondering if it would save db resources to go month by month, instead of querying for a whole year (or the whole range). I haven't done any sort of testing, so I have no idea if it would help or not.

tsteur · 2019-08-02T03:00:55Z

@diosmosis we were thinking if we go month by month it might be in the end too many queries and taking too long. The though was more like improving performance for high traffic Matomo's. And they likely find eg 500 visitors within the first day / week or month. And if there weren't 500 visitors in the first month, likely they also won't be found in the next months so it might be better to query the remaining date range at once.

We could be a bit "smart" maybe and have some logic if we find eg 25% of the results in one month, then we could assume that we will have all rows found if we query 3-4 more months instead of say the remaining 2 years. Is this maybe something you mean?

diosmosis · 2019-08-02T03:22:22Z

I just figured the high load was due to the number of rows being counted overall, and if so going over smaller chunks might be less of an issue. However, if that isn't what's causing the high load, then it wouldn't be a good idea.

Here's another crazy idea that's too difficult to implement: call VisitsSummary.get for multiple days over the last couple months or so, and use the number of visits to in rows to figure out where an offset starts. (I just wanted to write this idea since it seems strange yet workable, I don't think you should work on it :) )

Anyway, code looks good to me. Don't know if @mattab needs to look at it.

tsteur · 2019-08-02T03:33:33Z

@diosmosis that might not help when a segment is used and when the segment is not preprocessed for example.

tsteur added 5 commits July 26, 2019 09:45

starting to refactor live query

fc47869

more efficient way of fetching visitors for visitor log

b933fa0

only execute logic when dateStart is set

28f03fb

perform the same logic even when no date range is defined

b1a1bd0

add comment

77ea553

tsteur added c: Performance For when we could improve the performance / speed of Matomo. Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Jul 25, 2019

tsteur added this to the 3.12.0 milestone Jul 25, 2019

tsteur added 7 commits July 26, 2019 11:06

fix variable is not defined

4071ce9

trying to fix some tests

25c4a34

fix limit, offset (maybe)

01b798b

fix date timestamp

c5c8f0e

apply array_slice only when needed

4eb64d8

we cannot apply this logic when an offset is specified

4598736

fix tests

0e912eb

tsteur added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Jul 26, 2019

more efficient handling of offset

f54e19f

diosmosis reviewed Aug 1, 2019

View reviewed changes

plugins/Live/Model.php Outdated Show resolved Hide resolved

diosmosis reviewed Aug 1, 2019

View reviewed changes

plugins/Live/Model.php Outdated Show resolved Hide resolved

diosmosis reviewed Aug 1, 2019

View reviewed changes

plugins/Live/Model.php Outdated Show resolved Hide resolved

diosmosis reviewed Aug 1, 2019

View reviewed changes

apply some review feedback

249957e

diosmosis approved these changes Aug 4, 2019

View reviewed changes

tsteur merged commit 9de0efa into 3.x-dev Aug 4, 2019

tsteur deleted the perlive branch August 4, 2019 20:47

sgiehl mentioned this pull request Aug 10, 2019

Visitor Profile with more than 100 visits shows negative visits #14308

Closed

tsteur mentioned this pull request Aug 12, 2019

Tweak live queries when they query current week, month, or year where end date is in the future #14759

Merged

tsteur mentioned this pull request Sep 8, 2020

Looping with filter_limit and filter_offset does not always get all visits #16394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Visitor Log live query more performant #14700

Make Visitor Log live query more performant #14700

tsteur commented Jul 25, 2019 •

edited

tsteur commented Jul 26, 2019 •

edited

diosmosis Aug 1, 2019

tsteur Aug 2, 2019

diosmosis Aug 2, 2019

diosmosis commented Aug 1, 2019

tsteur commented Aug 2, 2019

diosmosis commented Aug 2, 2019

tsteur commented Aug 2, 2019

diosmosis commented Aug 2, 2019

tsteur commented Aug 2, 2019

Make Visitor Log live query more performant #14700

Make Visitor Log live query more performant #14700

Conversation

tsteur commented Jul 25, 2019 • edited

tsteur commented Jul 26, 2019 • edited

diosmosis Aug 1, 2019

Choose a reason for hiding this comment

tsteur Aug 2, 2019

Choose a reason for hiding this comment

diosmosis Aug 2, 2019

Choose a reason for hiding this comment

diosmosis commented Aug 1, 2019

tsteur commented Aug 2, 2019

diosmosis commented Aug 2, 2019

tsteur commented Aug 2, 2019

diosmosis commented Aug 2, 2019

tsteur commented Aug 2, 2019

tsteur commented Jul 25, 2019 •

edited

tsteur commented Jul 26, 2019 •

edited