@AJHoeh opened this Issue on April 29th 2021

When requesting data from the page url table and filtering it regarding a specific url, the data returned for requests for each month of a year independantly does not sum up to the data returned for a request regarding the whole year. This holds true both for using the webinterface as well as using the API.

Expected Behavior

Values of a metric for all months of a year should sum up to value of the same metric for the whole year.

Current Behavior

Sum of monthly values does not match yearly value (in my case: exceeds it).

Steps to Reproduce (for Bugs)

Example

  1. Get https://demo.matomo.cloud/?module=API&token_auth=anonymous&filter_limit=-1&method=Actions.getPageUrls&format=csv&flat=1&showColumns=nb_visits%2Cnb_hits&filter_pattern=%2Fdiving&convertToUnicode=0&idSite=1&period=year&date=lastYear
  2. Same request for each month of that year (12x), period=month:
    1. date=2020-01-01
    2. date=2020-02-01
    3. ...
  3. Compare returned data for the whole year with summed returned monthly data:
    1. Sum all values of columns nb_visits and nb_hits for every returned data table
    2. Sum the respective sums of all monthly data tables
    3. Compare values for whole year with calculated sum for all months

In the given (and randomly chosen) example the values are:

Metric Year Sum Month Difference
nb_visits 1.672.155 1.672.216 61
nb_hits 2.032.618 2.032.691 73

I admit the diifferences are quite low here, but there honestly shouldn't be any at all. Furthermore for our own data it looks like this which is really non-neglectable:

Metric Year Sum Month Difference
nb_visits 38.728 41.385 2.657
nb_hits 54.366 58.026 3.660

Context

We need total (unique) pageviews for a certain subset of urls. Usually we report them on a yearly bases but we now needed them for every month of a year and realized that the monthly values do not add up to the respective yearly value.

Your Environment

  • Matomo Version: 4.2.1
  • PHP Version: 7.3
  • Server Operating System: ubuntu18.04.1
  • Additionally installed plugins: HeatmapSessionRecording (v4.0.11)
@flamisz commented on April 29th 2021 Contributor

Hi @AJHoeh, thanks for creating the issue.

I can confirm those number, really have that difference between the yearly and the sum of monthly reports (I even tried to get the monthly report with date in the middle of month, just to make sure this is not a weird issue with the edge of the months). Sorry about this.

We do our best to investigate what could makes this different.

@tsteur commented on April 29th 2021 Member

@AJHoeh in case you can it may be worth it invalidating the reports for the entire year and reprocessing it again see https://matomo.org/faq/how-to/faq_155/ and checking if it comes right afterwards. For example you could run a console command like

./console core:invalidate-report-data --dates=2020-01-01,2020-12-31 --sites=1
// followed by
./console core:archive

Please note that this may take a long time to regenerate all reporting data and that if you have configured raw data deletion then you may not want to invalidate the reports as data could become unavailable if the raw data has been deleted already.

Generally, as part of Matomo 4 and also in the next 4.3.0 release we've made several improvements around these things that might have fixed it already.

Partially refs https://github.com/matomo-org/matomo/issues/10127 but there it's more about days vs other periods.

We need total (unique) pageviews for a certain subset of urls on different websites.

BTW @AJHoeh not quite sure what you mean here by urls on different websites? Are those different websites (with different domains) tracked in one Matomo site or do you mean different sites in Matomo?

@AJHoeh commented on April 30th 2021

Hey @flamisz and @tsteur, thanks for the fast reply!

Just to not send you down the wrong road, it came to my mind that I didn't test whether the bug is also present for unfiltered data, so it is possible that this is a more general issue and has nothing to do with filtering.

@tsteur thanks for the advice, "Regularly delete old raw data" is not enabled so I will try that next week when I am at work again. Is there some kind of command to check whether the respective raw data is still available? I am pretty cautious with this kind of stuff. I probably should backup the db before anyways...

BTW @AJHoeh not quite sure what you mean here by urls on different websites? Are those different websites (with different domains) tracked in one Matomo site or do you mean different sites in Matomo?

That's just me being stupid and giving irrelevant information which in the end is more confusing than helpful. It's about performing the described procedure for each of two sites in Matomo which is our concrete usecase but irrelevant information for the issue itself. I'll edit out the "on different websites" part and elaborate more on the context to keep the issue as clear as possible.

@tsteur commented on May 2nd 2021 Member

Is there some kind of command to check whether the respective raw data is still available

Unfortunately there isn't. In case you have access to your database and if you are familiar with MySQL then you could check using a query below:

select idvisit from matomo_log_visit where visit_last_action_time < '2020-01-01 00:00:00' LIMIT 1

You might need to adjust the table prefix matomo_ with something different like piwik_ or remove it depending on your database configuration. If it returns a result, then there is raw data for that date.

@heurteph-ei commented on May 3rd 2021

Hi all,
A little idea (not tested because I have no such data): Maybe the difference is when some visits are made around midnight (start before and end after). In such case, the visit belongs to 2 days or even in 2 months in case of end/begining of month...
The visit should then appear in month 1 and in month 2 but should be counted only once for the year accumulation.
IDEA TO BE CHECKED!
Also, could be related to issue #17516 (some visits / hits between 00:00 and 00:01 would then counted twice)

@tsteur commented on May 3rd 2021 Member

#17516 is only for the visits log but not for reports in general (they use very different underlying code). I'm quite certain the visitors being counted twice is not an issue but at the same time you can never rule anything out. The way these reports are generated I very much doubt this is the case though here. Good bringing this up though

@AJHoeh commented on May 4th 2021

@tsteur Thanks for the guidance, I appreciate it. Data was there so I invalidated the reports and manually archived the data again. Unfortunately, the result is exactly the same (actually thats probably a good thing, otherwise I wouldnt have a clue whats going on and whether I could trust any of our data anymore). If there is anything I can do or test to help find out what causes the bug, please let me know.

What @heurteph-ei said was my first thought aswell, it only seems almost too simple... but sometimes exactly thats the reason why things are overlooked so definitely worth looking into :)

@heurteph-ei commented on May 4th 2021

@AJHoeh Maybe you can try to find some visits thanks to the segmentation using one or other filter below:
image
and then check the visit logs of the result...
@tsteur , a feature is missing: date or numeric values comparisons in the segmentation (<,, , >). Is this planed in the future? Or must I create a new issue?

@tsteur commented on May 5th 2021 Member

@heurteph-ei there should be already a feature request for this 👍

@AJHoeh commented on May 5th 2021

Okay, so I conducted a few experiments:

  1. I requested the total pageviews via method=VisitsSummary.get and compared yearly vs sum of monthly data - they mathed
  2. I did the same thing initially described in this issue but left out the filter parameter - the data matched
  3. I did the same thing initially described in this issue but used period=range and date=2020-01-01,2020-12-31 instead of year - the returned value matched the yearly one (and therefore was not #consistent with the sum of monthly values)

So it seems this is really an issue related to the combined use of filters and some kind of sub-periods.

Thanks @heurteph-ei for the hint, I tried adding visitStartServerHour==23;visitServerHour==0 to my initial test requests for our website, but the returned monthly values sumed up to only 29 nb_visits and 44 nb_hits so I am unsure if a larger window could explain the observed (~100 fold) difference.

I tried to request data for a given month and segmenting it by setting visitEndServerMonth to the following month, but it seems that is not supported for period=month @tsteur?
However, when I set period=year there were also a few visits returned which actually happend at the very beginning of the month following visitEndServerMonth:
border_case

This seems really weird to me aswell and is probably not intended?

@tsteur commented on May 10th 2021 Member

it by setting visitEndServerMonth to the following month, but it seems that is not supported for period=month @tsteur?

If you select a month, then it would be expected to only return visits from the current month if the segment also is the number for the current month. Not sure it's clear what I mean?

You might be looking at the visitor log which is very different logic to regular reports and there is partially some logic to show some extra visits from very beginning of the following month but this is not the case for a report.

Powered by GitHub Issue Mirror