Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looping with filter_limit and filter_offset does not always get all visits #16394

Closed
asharov opened this issue Sep 4, 2020 · 7 comments · Fixed by #17060
Closed

Looping with filter_limit and filter_offset does not always get all visits #16394

asharov opened this issue Sep 4, 2020 · 7 comments · Fixed by #17060
Assignees
Labels
Bug For errors / faults / flaws / inconsistencies etc. Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Milestone

Comments

@asharov
Copy link

asharov commented Sep 4, 2020

Hello,

We're fetching visit details from Matomo using the Live.getLastVisitsDetails method. We're using the filter_limit and filter_offset parameters to limit the number of visits fetched per request like this:
filter_limit=500&filter_offset=0
filter_limit=500&filter_offset=500
filter_limit=500&filter_offset=1000
and so on until we get less than 500 results in a response.

What we're seeing is that sometimes there are actual gaps between two consecutive responses, so we're not getting all the visits. I was able to reproduce this with the demo.matomo.org site with the following requests:

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&filter_offset=4000
https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&filter_offset=5000

The last visit on the first response has server time "2020-09-02 21:30:16" and the first visit of the second response has server time "2020-09-02 08:30:24", leaving quite a big gap. Looking at different filter_offset values between these two, I notice the issue seems to happen between 4665 and 4666, that is, between these two requests:

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&filter_offset=4665
https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&filter_offset=4666

The visits in the first response have time range from "2020-09-02 18:00:31" to "2020-09-03 00:00:11", and in the second response from "2020-09-02 05:35:55" to "2020-09-02 10:05:10", so there is no overlap between the time ranges, even though the offsets are consecutive, so apart from the first and last, all the visits should be shared between both responses.

For these dates, if I set filter_limit to 5000, so filter_offset takes values 0, 5000, 10000, I seem to get all the data without problems, but in our situation we cannot increase the filter_limit because of resource constraints, and in any case I don't know if other date ranges exhibit this problem with different filter_limit values.

Best regards,
Jaakko

@tsteur
Copy link
Member

tsteur commented Sep 6, 2020

Hi @asharov

To loop over the last visits details API I would recommend not using filter_offset but minTimestamp.

Eg

?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000

Then you get the timestamp of the last visit and fetch all following visits from the timestamp after that.

?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&minTimestamp=XYZ

And you repeat the same until there are less than 1000 results.

There could be an edge case where there are many visits for the very same second and only some of them are returned. So it may pay off to do subtract 1s from the minTimestamp to fetch the visits from the last second again and discard them from the previous response.

Hope this helps.

@tsteur tsteur closed this as completed Sep 6, 2020
@tsteur tsteur added the answered For when a question was asked and we referred to forum or answered it. label Sep 6, 2020
@asharov
Copy link
Author

asharov commented Sep 8, 2020

Hi @tsteur

Thanks for the suggestion. Unfortunately I cannot make it work. The problem is that the returned visits are ordered latest one first. So when I make the request

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000

the latest result is from "2020-09-03 23:56:08", that is, from the very end of the date range I requested. If I use the timestamp, 1599170168, from this minus 1s as minTimestamp, like this

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&minTimestamp=1599170167

I get only one response. As expected, I guess, since the timestamp is near the end of the range. Also, trying to set minTimestamp to an early part of the date range, like this

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-02,2020-09-03&period=range&format=json&filter_limit=1000&minTimestamp=1599000000

gives the same response as the request without minTimestamp.

Looking at the API documentation, I noticed the filter_sort_order and filter_sort_column parameters, but they also behave strangely. When I add filter_sort_order=asc&filter_sort_column=serverTimestamp to my request, the responses do come in ascending order by time, but they start from the first visit of 2020-09-03, and not from the first visit of 2020-09-02, which is the start of the queried date range. It doesn't seem possible to get the visits from 2020-09-02 at all with these parameters.

Do you have any further suggestions of what could work?

Best,
Jaakko

@tsteur tsteur reopened this Sep 8, 2020
@tsteur tsteur added Regression Indicates a feature used to work in a certain way but it no longer does even though it should. Bug For errors / faults / flaws / inconsistencies etc. and removed answered For when a question was asked and we referred to forum or answered it. labels Sep 8, 2020
@tsteur
Copy link
Member

tsteur commented Sep 8, 2020

Thanks for the detailed information @asharov

Looking through the code I think this is a regression from #14700 as filter_sort_order is no longer working as expected. It would work if filter_sort_order was working correctly.

In particular, I think this is causing the problem: https://github.com/matomo-org/matomo/pull/14700/files#diff-366368dca569c3a4e0b3b0dc3724ecb0R115-R163

Say the requested date range is 2020-08-02,2020-08-30. Then it would apply the filter_sort_order but it would always first look for visits in 2020-08-30, then in the next query in 2020-08-22,2020-08-29, then 2020-08-02,2020-08-21. It should do it in this case probably do it the other way around.

Does the logic work when you request the data for each day individually (with filter_sort_order parameter)?

@tsteur tsteur added this to the 4.1.0 milestone Sep 8, 2020
@asharov
Copy link
Author

asharov commented Sep 9, 2020

Yes, if I request only for a single day, using filter_sort_order and looping with minTimestamp as you suggested does seem to get all the visits for that day. That is, with the initial request

https://demo.matomo.org/?module=API&method=Live.getLastVisitsDetails&idSite=62&date=2020-09-03&period=day&format=json&filter_limit=1000&filter_sort_order=asc&filter_sort_column=serverTimestamp

and then making further requests by setting minTimestamp based on the serverTimestamp of the last visit from the previous response. The first and last times of the overall visits are at the ends of the day and there are no gaps between different responses.

So I think I can use this method in our case now to patch any missing data as a workaround, but it would of course be preferable to have this method working over a range of days.

@tsteur
Copy link
Member

tsteur commented Sep 9, 2020

yes definitely it should work over a range of days. It's a regression as it used to work and a fix is likely not too hard.

@tsteur tsteur added the Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement. label Sep 9, 2020
@TheCrowned
Copy link

Hey @tsteur! I would like to take care of this. Would simply reversing the order of the return array of splitDatesIntoMultipleQueries as in
$queries = $this->splitDatesIntoMultipleQueries($dateStart, $dateEnd, $limit, $offset);
be enough (of course, provided that $filterSortOrder === 'asc')? Can you foresee further issues with this change? :)

@tsteur
Copy link
Member

tsteur commented Sep 28, 2020

@TheCrowned this should work. Just need to make sure that in each date combination in the array the first one is the lower date and the second the higher date. Eg by default it would maybe return something like

[
 ['2020-09-28 00:00:00', '2020-09-28 23:59:59'], // look first last24 hours
 ['2020-09-21 00:00:00', '2020-09-27 23:59:59'] // then last 6 or 7 days
 ['2020-08-21 00:00:00', '2020-09-20 23:59:59'] // then last 30 days or so
]

it should then return

[
 ['2020-08-21 00:00:00', '2020-08-21 23:59:59'], // look first last24 hours
 ['2020-08-22 00:00:00', '2020-08-28 23:59:59'] // then last 7 days
 ['2020-08-29 00:00:00', '2020-09-28 23:59:59'] // then last 30 days or so
]

It's just an example. I'm basically only meaning it's important it returns eg ['2020-08-21 00:00:00', '2020-08-21 23:59:59'] and not the other way around ['2020-08-21 23:59:59', '2020-08-21 00:00:00'].

It should not simply reverse the returned order from the initial return. It should not return this:

[
 ['2020-08-21 00:00:00', '2020-09-20 23:59:59'] // then last 30 days or so
 ['2020-09-21 00:00:00', '2020-09-27 23:59:59'] // then last 7 days
 ['2020-09-28 00:00:00', '2020-09-28 23:59:59'], // look first last24 hours
]

I haven't looked into the implementation so not sure how it be best developed or what needs to be done. We basically always want to make sure to first look only at one day, then one week, then 30 days, etc for performance reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Help wanted Beginner friendly issues or issues where we'd highly appreciate community's help and involvement. Regression Indicates a feature used to work in a certain way but it no longer does even though it should.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants