Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Visitor Log live query more performant #14700

Merged
merged 14 commits into from Aug 4, 2019
Merged

Make Visitor Log live query more performant #14700

merged 14 commits into from Aug 4, 2019

Conversation

tsteur
Copy link
Member

@tsteur tsteur commented Jul 25, 2019

Sometimes the live query can be really slow (like many minutes). It could even end up using all database resources when there is a lot of data in the log tables. Especially when using segments on top. Often the system ends up doing queries like these:

SELECT sub.* FROM (
            SELECT
                log_inner.*
            FROM     
        (
            SELECT
                log_visit.*
            FROM
                log_visit AS log_visit LEFT JOIN log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
            WHERE
                ( log_visit.idsite in ('1') 
                AND log_visit.visit_last_action_time >= '2019-05-25 23:00:00' )
                AND
                ( ( log_link_visit_action.idaction_url IS NOT NULL AND log_link_visit_action.idaction_url <> '' AND log_link_visit_action.idaction_url <> '0' ) )
            ORDER BY
                idsite DESC, visit_last_action_time DESC LIMIT 0, 4010
        ) AS log_inner
            ORDER BY
                idsite DESC, visit_last_action_time DESC
            ) AS sub
            GROUP BY sub.idvisit
            ORDER BY sub.visit_last_action_time DESC
        LIMIT 401

Where it fetches data over several months. This obviously can take a very long time, and it might create big temporary tables, full table scans, etc. As we're only wanting the most recent results, it be smarter to first look only at the last day and if this query returns enough results, then we can directly return the result making it a LOT faster.

In this PR I'm executing up to 5 queries to get the result instead of doing it in one query:

  • Look one day back
  • Look another 1 week back (without fetching first day again)
  • Look 1 month back (without looking at the 1week + 1day we already fetched again)
  • Look 1 year back (without looking at the data we already requested)
  • Remaining time

If only 1 week is being requested, it would only execute 2 queries. If 2 years are requested, it would execute up to 5 queries but stop as soon as the requested data has been found. So we might be actually only looking at one day compared to 2 years!

The logic can be only done if there is a data limit. Otherwise we need to return all results anyway and this can still be a problem fyi @mattab

It cannot be executed when there is a limit plus an offset @mattab . So it won't solve too many problems possibly. Unless we executed each query twice see this logic:

Imagine we have `Limit 5 Offset 5`
Query 1 finds: 2 rows  => should use Limit 5, Offset 5. The result of the method would return an empty array. Because of the offset, we can't know that 2 rows have been skipped unless we executed every query twice when no result has been returned? First we would need to execute `Limit 5, Offset 5`, then we would need to execute `Limit $offset, Offset 0` to find out how many rows were skipped
Query 2 finds: 0 rows  => should use Limit 5, Offset 3 
Query 3 finds: 7 rows  => should use Limit 5, Offset 3 and basically should skip the first 3 rows and then return the next 4
Query 4 finds: 0 rows: => should use Limit 1, Offset 0: that one is easy... as soon as at least one row has been found we use offset => 0 and limit can be calculated easily too (originalLimit - foundRows).

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

@tsteur tsteur added c: Performance For when we could improve the performance / speed of Matomo. Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Jul 25, 2019
@tsteur tsteur added this to the 3.12.0 milestone Jul 25, 2019
@tsteur tsteur added Needs Review PRs that need a code review and removed Pull Request WIP Indicates the current pull request is still work in progress and not ready yet for a review. labels Jul 26, 2019
@tsteur
Copy link
Member Author

tsteur commented Jul 26, 2019

Another workaround for offset that slightly improves it could be that when Offset <= 3 * LIMIT then we apply the offset in PHP and not in MySQL. Means we fetch a lot more rows but then discard them in PHP. Would make this logic work for a few more pages.

Maybe not the best idea actually.

And another workaround be, if there's an offset, we simply check if the first day gives results and if not fetch all other days.

Implemented something like this but haven't pushed yet. It be a compromise between things maybe and possibly still be somewhat efficient for high traffic websites.

plugins/Live/Model.php Outdated Show resolved Hide resolved
plugins/Live/Model.php Outdated Show resolved Hide resolved
plugins/Live/Model.php Outdated Show resolved Hide resolved
}
} else {
$queries[] = array($dateStart, $dateEnd);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an idea that might make it a bit faster (assuming I'm reading this correctly). What the code is doing now is this:

  • check dateStart => dateStart + 1 day
  • still no visits? check dateStart => dateStart + 7 days
  • still no visits? check dateStart => dateStart + 30 days
  • ... etc.

In each query we're checking the same rows as the previous one, we could change it to something like:

  • check dateStart => dateStart + 1 day
  • still no visits? check dateStart + 1 day => dateStart + 7 days
  • still no visits? check dateStart + 7 days => dateStart + 30 days
  • ... etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diosmosis it should already do what you suggested. Unless I'm not understanding it right. But basically the code does not fetch the same day or week twice for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, $virtualEndDate changes, I didn't notice that.

@diosmosis
Copy link
Member

I wonder if this is similar to Db::segmentedFetchAll() and if an approach like that could be used to lower the required resources. For example, check one day, check one week, then query one month at a time, instead of doing a whole year at once.

@tsteur
Copy link
Member Author

tsteur commented Aug 2, 2019

It's quite similar @diosmosis just that Db::segmentedFetchAll() maybe doesn't know anything re limit/offset/bind and has some first/last/step which I don't need.

@diosmosis
Copy link
Member

@tsteur I was actually more just wondering if it would save db resources to go month by month, instead of querying for a whole year (or the whole range). I haven't done any sort of testing, so I have no idea if it would help or not.

@tsteur
Copy link
Member Author

tsteur commented Aug 2, 2019

@diosmosis we were thinking if we go month by month it might be in the end too many queries and taking too long. The though was more like improving performance for high traffic Matomo's. And they likely find eg 500 visitors within the first day / week or month. And if there weren't 500 visitors in the first month, likely they also won't be found in the next months so it might be better to query the remaining date range at once.

We could be a bit "smart" maybe and have some logic if we find eg 25% of the results in one month, then we could assume that we will have all rows found if we query 3-4 more months instead of say the remaining 2 years. Is this maybe something you mean?

@diosmosis
Copy link
Member

I just figured the high load was due to the number of rows being counted overall, and if so going over smaller chunks might be less of an issue. However, if that isn't what's causing the high load, then it wouldn't be a good idea.

Here's another crazy idea that's too difficult to implement: call VisitsSummary.get for multiple days over the last couple months or so, and use the number of visits to in rows to figure out where an offset starts. (I just wanted to write this idea since it seems strange yet workable, I don't think you should work on it :) )

Anyway, code looks good to me. Don't know if @mattab needs to look at it.

@tsteur
Copy link
Member Author

tsteur commented Aug 2, 2019

@diosmosis that might not help when a segment is used and when the segment is not preprocessed for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. Needs Review PRs that need a code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants