Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtered monthly page views do not sum up to filtered yearly page views #17509

Open
AJHoeh opened this issue Apr 29, 2021 · 12 comments
Open

Filtered monthly page views do not sum up to filtered yearly page views #17509

AJHoeh opened this issue Apr 29, 2021 · 12 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc.

Comments

@AJHoeh
Copy link

AJHoeh commented Apr 29, 2021

When requesting data from the page url table and filtering it regarding a specific url, the data returned for requests for each month of a year independantly does not sum up to the data returned for a request regarding the whole year. This holds true both for using the webinterface as well as using the API.

Expected Behavior

Values of a metric for all months of a year should sum up to value of the same metric for the whole year.

Current Behavior

Sum of monthly values does not match yearly value (in my case: exceeds it).

Steps to Reproduce (for Bugs)

Example

  1. Get https://demo.matomo.cloud/?module=API&token_auth=anonymous&filter_limit=-1&method=Actions.getPageUrls&format=csv&flat=1&showColumns=nb_visits%2Cnb_hits&filter_pattern=%2Fdiving&convertToUnicode=0&idSite=1&period=year&date=lastYear
  2. Same request for each month of that year (12x), period=month:
    1. date=2020-01-01
    2. date=2020-02-01
    3. ...
  3. Compare returned data for the whole year with summed returned monthly data:
    1. Sum all values of columns nb_visits and nb_hits for every returned data table
    2. Sum the respective sums of all monthly data tables
    3. Compare values for whole year with calculated sum for all months

In the given (and randomly chosen) example the values are:

Metric Year Sum Month Difference
nb_visits 1.672.155 1.672.216 61
nb_hits 2.032.618 2.032.691 73

I admit the diifferences are quite low here, but there honestly shouldn't be any at all. Furthermore for our own data it looks like this which is really non-neglectable:

Metric Year Sum Month Difference
nb_visits 38.728 41.385 2.657
nb_hits 54.366 58.026 3.660

Context

We need total (unique) pageviews for a certain subset of urls. Usually we report them on a yearly bases but we now needed them for every month of a year and realized that the monthly values do not add up to the respective yearly value.

Your Environment

  • Matomo Version: 4.2.1
  • PHP Version: 7.3
  • Server Operating System: ubuntu18.04.1
  • Additionally installed plugins: HeatmapSessionRecording (v4.0.11)
@AJHoeh AJHoeh added the Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. label Apr 29, 2021
@flamisz
Copy link
Contributor

flamisz commented Apr 29, 2021

Hi @AJHoeh, thanks for creating the issue.

I can confirm those number, really have that difference between the yearly and the sum of monthly reports (I even tried to get the monthly report with date in the middle of month, just to make sure this is not a weird issue with the edge of the months). Sorry about this.

We do our best to investigate what could makes this different.

@flamisz flamisz added Bug For errors / faults / flaws / inconsistencies etc. and removed Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. labels Apr 29, 2021
@tsteur
Copy link
Member

tsteur commented Apr 29, 2021

@AJHoeh in case you can it may be worth it invalidating the reports for the entire year and reprocessing it again see https://matomo.org/faq/how-to/faq_155/ and checking if it comes right afterwards. For example you could run a console command like

./console core:invalidate-report-data --dates=2020-01-01,2020-12-31 --sites=1
// followed by
./console core:archive

Please note that this may take a long time to regenerate all reporting data and that if you have configured raw data deletion then you may not want to invalidate the reports as data could become unavailable if the raw data has been deleted already.

Generally, as part of Matomo 4 and also in the next 4.3.0 release we've made several improvements around these things that might have fixed it already.

Partially refs #10127 but there it's more about days vs other periods.

We need total (unique) pageviews for a certain subset of urls on different websites.

BTW @AJHoeh not quite sure what you mean here by urls on different websites? Are those different websites (with different domains) tracked in one Matomo site or do you mean different sites in Matomo?

@AJHoeh
Copy link
Author

AJHoeh commented Apr 30, 2021

Hey @flamisz and @tsteur, thanks for the fast reply!

Just to not send you down the wrong road, it came to my mind that I didn't test whether the bug is also present for unfiltered data, so it is possible that this is a more general issue and has nothing to do with filtering.

@tsteur thanks for the advice, "Regularly delete old raw data" is not enabled so I will try that next week when I am at work again. Is there some kind of command to check whether the respective raw data is still available? I am pretty cautious with this kind of stuff. I probably should backup the db before anyways...

BTW @AJHoeh not quite sure what you mean here by urls on different websites? Are those different websites (with different domains) tracked in one Matomo site or do you mean different sites in Matomo?

That's just me being stupid and giving irrelevant information which in the end is more confusing than helpful. It's about performing the described procedure for each of two sites in Matomo which is our concrete usecase but irrelevant information for the issue itself. I'll edit out the "on different websites" part and elaborate more on the context to keep the issue as clear as possible.

@tsteur
Copy link
Member

tsteur commented May 2, 2021

Is there some kind of command to check whether the respective raw data is still available

Unfortunately there isn't. In case you have access to your database and if you are familiar with MySQL then you could check using a query below:

select idvisit from matomo_log_visit where visit_last_action_time < '2020-01-01 00:00:00' LIMIT 1

You might need to adjust the table prefix matomo_ with something different like piwik_ or remove it depending on your database configuration. If it returns a result, then there is raw data for that date.

@heurteph-ei
Copy link

heurteph-ei commented May 3, 2021

Hi all,
A little idea (not tested because I have no such data): Maybe the difference is when some visits are made around midnight (start before and end after). In such case, the visit belongs to 2 days or even in 2 months in case of end/begining of month...
The visit should then appear in month 1 and in month 2 but should be counted only once for the year accumulation.
IDEA TO BE CHECKED!
Also, could be related to issue #17516 (some visits / hits between 00:00 and 00:01 would then counted twice)

@tsteur
Copy link
Member

tsteur commented May 3, 2021

#17516 is only for the visits log but not for reports in general (they use very different underlying code). I'm quite certain the visitors being counted twice is not an issue but at the same time you can never rule anything out. The way these reports are generated I very much doubt this is the case though here. Good bringing this up though

@AJHoeh
Copy link
Author

AJHoeh commented May 4, 2021

@tsteur Thanks for the guidance, I appreciate it. Data was there so I invalidated the reports and manually archived the data again. Unfortunately, the result is exactly the same (actually thats probably a good thing, otherwise I wouldnt have a clue whats going on and whether I could trust any of our data anymore). If there is anything I can do or test to help find out what causes the bug, please let me know.

What @heurteph-ei said was my first thought aswell, it only seems almost too simple... but sometimes exactly thats the reason why things are overlooked so definitely worth looking into :)

@heurteph-ei
Copy link

@AJHoeh Maybe you can try to find some visits thanks to the segmentation using one or other filter below:
image
and then check the visit logs of the result...
@tsteur , a feature is missing: date or numeric values comparisons in the segmentation (<,, , >). Is this planed in the future? Or must I create a new issue?

@tsteur
Copy link
Member

tsteur commented May 5, 2021

@heurteph-ei there should be already a feature request for this 👍

@AJHoeh
Copy link
Author

AJHoeh commented May 5, 2021

Okay, so I conducted a few experiments:

  1. I requested the total pageviews via method=VisitsSummary.get and compared yearly vs sum of monthly data - they mathed
  2. I did the same thing initially described in this issue but left out the filter parameter - the data matched
  3. I did the same thing initially described in this issue but used period=range and date=2020-01-01,2020-12-31 instead of year - the returned value matched the yearly one (and therefore was not #consistent with the sum of monthly values)

So it seems this is really an issue related to the combined use of filters and some kind of sub-periods.

Thanks @heurteph-ei for the hint, I tried adding visitStartServerHour==23;visitServerHour==0 to my initial test requests for our website, but the returned monthly values sumed up to only 29 nb_visits and 44 nb_hits so I am unsure if a larger window could explain the observed (~100 fold) difference.

I tried to request data for a given month and segmenting it by setting visitEndServerMonth to the following month, but it seems that is not supported for period=month @tsteur?
However, when I set period=year there were also a few visits returned which actually happend at the very beginning of the month following visitEndServerMonth:
border_case

This seems really weird to me aswell and is probably not intended?

@tsteur
Copy link
Member

tsteur commented May 10, 2021

it by setting visitEndServerMonth to the following month, but it seems that is not supported for period=month @tsteur?

If you select a month, then it would be expected to only return visits from the current month if the segment also is the number for the current month. Not sure it's clear what I mean?

You might be looking at the visitor log which is very different logic to regular reports and there is partially some logic to show some extra visits from very beginning of the following month but this is not the case for a report.

@AJHoeh
Copy link
Author

AJHoeh commented May 19, 2021

@tsteur Sorry for the late reply:

If you select a month, then it would be expected to only return visits from the current month if the segment also is the number for the current month. Not sure it's clear what I mean?

I am not sure whether I got that. What I meant was a request with e.g. date=2020-01-01 period=month segment=visitEndServerMonth==2. These requests do not return data for me, so I suppose this is not supported or visits are always attributed to the month in which they endend (and not started). Also possible that my logic is flawed here.

You might be looking at the visitor log which is very different logic to regular reports and there is partially some logic to show some extra visits from very beginning of the following month but this is not the case for a report.

This is true for the very last test I described (starting at "However") and for which I provided the screenshot. Thought the web interface and visit log would be the easiest tool to spot the phenomenon in question here. But if the log works differently from the other reports just forget about it. Everything else was conducted via requesting the API targeting method=Actions.getPageUrls.

EDIT: Just upgraded to 4.3 and the problem persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc.
Projects
None yet
Development

No branches or pull requests

5 participants