Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results from Reporting API when defining segments using pageTitle =^ (starts with) #17507

Closed
patrick-stickler-csc-fi opened this issue Apr 28, 2021 · 18 comments
Labels
Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced.

Comments

@patrick-stickler-csc-fi
Copy link

patrick-stickler-csc-fi commented Apr 28, 2021

We are encountering unexplainable inconsistency in the Matomo API responses when segmenting data using pageTitle prefixes.

Please see the query examples below and their Matomo responses, and the subsequent questions.

Notes:

1 - We do not utilize the Matomo Javascript at all, but push all event data directly to the Matomo Tracking API using custom application code, whereby we explicitly specify the title using the action_name= URL parameter.

2 - None of the segments used in the queries below are pre-defined via the Matomo web UI and it is understood/presumed that it is not necessary to do so, that if the segment is not cached, it will be created in order to produce the appropriate query response.

3 - In the examples below, the query URLs have been sanitized and the JSON response from Matomo has been pruned to only show the most relevant fields, which provide the particular values needed by our application. All queries reference the same site ID.

Query examples and responses:

1 - Segment: pageTitle == "PRODUCTION / IDA / FILES / FREEZE / FILE":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%3DPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE%20%2F%20FILE" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":24,"nb_hits":53,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"}
]

2 - Segment: pageTitle =^ "PRODUCTION / IDA / FILES / FREEZE / FILE":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%5EPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE%20%2F%20FILE" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":26,"nb_hits":55,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"}
]

3 - Segment: pageTitle =^ "PRODUCTION / IDA / FILES / FREEZE /":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%5EPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE%20%2F" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":26,"nb_hits":55,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"},
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FOLDER","nb_visits":22,"nb_hits":30,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFOLDER"}
]

4 - Segment: pageTitle =^ "PRODUCTION / IDA / FILES / FREEZE ":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%5EPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE%20" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":22,"nb_hits":40,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"},
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FOLDER","nb_visits":20,"nb_hits":28,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFOLDER"}
]

5 - Segment: pageTitle =^ "PRODUCTION / IDA / FILES / FREEZE":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%5EPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":22,"nb_hits":40,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"},
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FOLDER","nb_visits":20,"nb_hits":28,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFOLDER"}
]

6 - Segment: pageTitle =^ "PRODUCTION / IDA / FILES / FREEZ":

curl -s -k "https://localhost/index.php?token_auth=*&format=json&date=2021-04-01&period=month&idSite=*&module=API&method=Actions.getPageTitles&segment=pageTitle%3D%5EPRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZ" 

[
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FILE","nb_visits":27,"nb_hits":56,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFILE"},
{"label":" PRODUCTION \/ IDA \/ FILES \/ FREEZE \/ FOLDER","nb_visits":22,"nb_hits":30,"segment":"pageTitle==PRODUCTION%2B%252F%2BIDA%2B%252F%2BFILES%2B%252F%2BFREEZE%2B%252F%2BFOLDER"}
]

Questions:

1 - Why does the first == (equals) query not produce the same results as the second =^ (starts with) query, where they are identical except for the pageTitle comparison operator? Surely it is always true that a string starts with itself.

2 - Why do the last 4 =^ (starts with) queries not report the same results, when they should match the same set of event page titles, since the event page titles all start with all of the variant prefix strings?

3 - Why are the pageTitle values in the JSON output double URL encoded? And why is there URL encoding at all, since they are valid JSON string value characters? I would expect the JSON output to be either:

"segment":"pageTitle==PRODUCTION%20%2F%20IDA%20%2F%20FILES%20%2F%20FREEZE%20%2F%20FILE"

or

"segment":"pageTitle==PRODUCTION / IDA / FILES / FREEZE / FILE"

Expected Behavior

Page titles are matched correctly and consistently per the specified =^ starts with prefix string.

Current Behavior

See above.

Your Environment

Matomo version: 4.2.1
MySQL version: 5.5.68-MariaDB
PHP version: 7.3.19
utf8mb4 used throughout

Active Plugins:

Actions (Core)
CustomJsTracker (Core)
Dashboard (Core)
DevicePlugins (Core)
DevicesDetection (Core)
Diagnostics (Core)
Events (Core)
Feedback (Core)
ForceSSL (v4.0.1)
GeoIp2 (Core)
Goals (Core)
Heartbeat (Core)
ImageGraph (Core)
Insights (Core)
Live (Core)
Login (Core)
Marketplace (Core)
Monolog (Core)
Overlay (Core)
PagePerformance (Core)
PrivacyManager (Core)
Resolution (Core)
ScheduledReports (Core)
SegmentEditor (Core)
Tour (Core)
Transitions (Core)
TwoFactorAuth (Core)
UserCountry (Core)
UserCountryMap (Core)
UserId (Core)
UserLanguage (Core)
VisitFrequency (Core)
VisitTime (Core)
VisitorInterest (Core)
VisitsSummary (Core)
Widgetize (Core)

@patrick-stickler-csc-fi patrick-stickler-csc-fi added the Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced. label Apr 28, 2021
@diosmosis
Copy link
Member

diosmosis commented May 5, 2021

Hi @patrick-stickler-csc-fi, thanks for reporting this, it does seem odd.

Regarding questions 1&2:

I tried reproducing this on matomo.cloud and wasn't able to. So my guess right now is that the data for some or all of these segments might be out of date (there were some bugs in 4.2.1 around this logic). Can you try invalidating the segments for the date range and requesting them again? To do so you'll have to use the core:invalidate-report-data command (there is also an InvalidateReports plugin that lets you invalidate through the UI, but this will not let you supply a segment that wasn't created in the UI). The segment parameter has to be encoded properly to work, it should have the same exact encoding that is sent in the API request URL.

Note: you can also add the segments to the UI via the API using the SegmentEditor.add method (mentioning in case that is useful to you).

Regarding question 3:

Why are the pageTitle values in the JSON output double URL encoded? And why is there URL encoding at all, since they are valid JSON string value characters?

This is how Matomo allows operator characters to be placed in segment condition values. If for some reason, your page title had an = or ^ or @ in it, simply using pageTitle==my=pageTitle^with@symbols would not be parsed correctly. The value is encoded twice, because that was added to the code along time ago and hasn't been changed since. There's an issue for this here: #17050

@patrick-stickler-csc-fi
Copy link
Author

patrick-stickler-csc-fi commented May 6, 2021 via email

@patrick-stickler-csc-fi
Copy link
Author

@patrick-stickler-csc-fi
Copy link
Author

@diosmosis
Copy link
Member

Hi @patrick-stickler-csc-fi:

Is there a way to use the console:invalidate-report-data which invalidates all report data for all dates for all segments for a particular site ID?
Or do I need to execute console:invalidate-report-data for each and every segment that I have used in the past via the analytics API?

Unfortunately not at the moment. You could specify a large enough date range (see below in the next part of this reply), but currently there's no way to specify all segments. And if there were, it would only be possible to invalidate segments that are specifically created in the UI or via the SegmentEditor API, since otherwise they would not be stored in the segment table, and we wouldn't know what they are.

Also, when I use the —cascade option, with -vvv and specifying a —dates parameter that is just the year, I get no definitive indication that all days have been invalidated. The verbose output is confusing. Why only 4 dates listed:

/var/www/html/matomo/console core:invalidate-report-data --dates=2021 --sites=14 --cascade -vvv

The date in this command is incorrect, it should be a fully specified date in the YYYY-MM-DD format or a date range like YYYY-MM-DD,YYYY-MM-DD.

To invalidate everything within say 2019-2021 for a specific segment, you could use the command:

./console core:invalidate-report-data --dates=2019-01-01,today --sites=14 --segment=...

--cascade is for cascading downwards (ie, invalidating days in the week). The command automatically invalidates periods above (ie, the month containing a week).

Note: If you are using browser triggered archiving, you can also just delete the archives for a site, at which point the archives would be created again. Before simply doing that it might be useful to verify the reason the data is inaccurate is because of outdata not being rearchived, and not because of some other bug. Invalidating would be an easy way to do that. Or you could delete the archive data for a single day and single site and see if the result from the API changes. (Don't delete archive data if you have log purging enabled, as then you wouldn't be able to recompute the archive data).

To delete archive data manually:

  • identify the table the data belongs to. This is based on the start date of the period of the archive. So if it's for a week starting on April 29th, 2021, the data would be in the archive_*_2021_04 tables.
  • to delete data for a single period, run the queries:
DELETE FROM archive_blob_YYYY_MM WHERE idsite = <idSite> AND date1 = <start of date range> AND date2 = <end of date range> AND period = <period type integer id>
DELETE FROM archive_numeric_YYYY_MM WHERE idsite = <idSite> AND date1 = <start of date range> AND date2 = <end of date range> AND period = <period type integer id>

Where idSite is the site ID, date1 is the start date of the period, date2 the end date, and period the period ID (1 = day, 2 = week, 3 = month, 4 = year, 5 = range).

To delete data for an entire site, run the following queries (again I wouldn't do this until the actual cause of the issue is found, especially since I don't know exactly how your matomo is configured):

DELETE FROM archive_blob_YYYY_MM WHERE idsite = ?
DELETE FROM archive_numeric_YYYY_MM WHERE idsite = ?

@patrick-stickler-csc-fi
Copy link
Author

@patrick-stickler-csc-fi
Copy link
Author

@patrick-stickler-csc-fi
Copy link
Author

@sgiehl
Copy link
Member

sgiehl commented May 11, 2021

Is matomo.cloud running 4.2.1? The same latest version that is available for download? Or is it in any way different from the publically released 4.2.1 code base?

@patrick-stickler-csc-fi Matomo cloud is running 4.2.1, but we have already applied some of the PRs that are part of the upcoming 4.3.0

@diosmosis
Copy link
Member

@patrick-stickler-csc-fi more answers:

When an entirely new segment, neverbefore used, is specified in a query to the reporting API, is that segment, and its results in any way being stored by Matomo, for the optimization of future queries?

There are two types of data in Matomo, log data and archive data. Log data is the raw visit data, each action a user takes. The archive data is the result of aggregating that data; it's the report data cached in the archive tables.

So the answer is yes, that is one half of Matomo's core report generation mechanism.

Is it possible that segments defined via the dashboard/UI might be affecting direct API queries specifying those same segments?

No, it shouldn't be. The segment definitions are hashed via md5 and identified that way when looking in the DB, so it's possible a collision might cause two segments to have the same hash, but that is very unlikely.

What in Matomo might be causing the initial query to fail, when clearly the data exists in the database, and is found by other queries? Or am I still doing something incorrectly in my queries?

The result of this test confirms that at least one issue is that archives are not being rearchived. Since it's a custom segment archiving should be triggered, but instead of either showing the old data or showing different data, it shows nothing. There are quite a few fixes around this logic in 4.3, so it might be better to wait until 4.3 is released (which should be relatively soon).

To continue diagnosing, you could run the following query:

SELECT * FROM archive_numeric_2021_05 WHERE idsite = 14 AND date1 = '2021-05-03' and date2 = '2021-05-03' and period = 1 AND name LIKE 'done7fc20d1d1a85a360f07b17433450892f%';

This should provide archive status rows for the first query that is returning [] for you. Also, could you try running the same curl as before with the following query parameter: &segment=pageTitle%3D%3DPRODUCTION%252520%25252F%252520SSO%252520%25252F%252520LOGIN%252520%25252F%252520IDA?

Do you have any links to the issues or discussion pertaining to the known bugs you refer to? So I can review their nature, and what is being done to resolve them?

There were several fixes around the archiving logic in 4.3. These specific fixes are likely relevant to this problem:

@patrick-stickler-csc-fi
Copy link
Author

@diosmosis
Copy link
Member

@patrick-stickler-csc-fi 4.3.0 should be released relatively soon, it will hopefully solve these issues for you. This issue would be rather hard to debug through email so I think it'd be better to wait for now.

@patrick-stickler-csc-fi
Copy link
Author

@diosmosis
Copy link
Member

Hi @patrick-stickler-csc-fi, were you able to update to 4.3 and see if this fixed the issues you were experiencing?

@patrick-stickler-csc-fi
Copy link
Author

@diosmosis
Copy link
Member

@patrick-stickler-csc-fi only newly aggregated data. You may need to invalidate old data to initiate aggregation again.

@patrick-stickler-csc-fi
Copy link
Author

@mattab
Copy link
Member

mattab commented Dec 14, 2023

Thanks for contributing to this issue. As it has been a few months since the last activity and we believe this is likely not an issue anymore, we will now close this. If that's not the case, please do feel free to either reopen this issue or open a new one. We will gladly take a look again!

@mattab mattab closed this as not planned Won't fix, can't repro, duplicate, stale Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Potential Bug Something that might be a bug, but needs validation and confirmation it can be reproduced.
Projects
None yet
Development

No branches or pull requests

4 participants