@anthosz opened this Issue on September 28th 2020

Hello,

I have an issue when I want to use import_logs.py and I check "Only track visits and actions when the action URL starts with one of the above URLs." once I use url like example.com/! (for a shortener url). My goal is to create a website with a report for all url/pages starting with "/!*".

Example:
Url (tried also with https & a * and the end): http://example.com/!

Scenario 1 (doesn't works):
Enabled: Only track visits and actions when the action URL starts with one of the above URLs.
Log:
example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200
./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" access.log
-> Nothing new in log_link_visit_action table

Scenario 2 (works):
Disabled: Only track visits and actions when the action URL starts with one of the above URLs.
Log:
example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200
./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" access.log
-> New entry in log_link_visit_action table

Scenario 3 (works):
Disabled: Only track visits and actions when the action URL starts with one of the above URLs.
Log:
example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200
./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" --hostname=example.com --include-path='/!*' access.log
-> New entry in log_link_visit_action table (so it works if I force the path in import_logs but not in matomo -> I need to launch several time the import_logs in this case)

In this case, my goal is not to use a path separated by slash (/) but by exclamation mark "!".

If you need more informations, doesn't hesitate.

Thank you!

@anthosz commented on September 28th 2020

It seems that indeed, all separators are managed by slash in https://github.com/matomo-org/matomo/blob/3.14.1/plugins/SitesManager/SiteUrls.php

I don't know if you have something like a patch to allow other separator?

@tsteur commented on September 28th 2020 Member

@anthosz If I understand things correct what you are after then you're wanting to only match paths where the path starts with /!* vs currently Matomo would only support excluding URLs where the path is `/!/*? Do I understand this right?

This would be kind of on purpose currently if I understand things correctly since for Matomo there's currently no way to differentiate which behaviour someone expects.

@anthosz commented on September 28th 2020

@anthosz Yes, that's what I would like, have the possibility to also take into account "/!*"

@anthosz commented on September 28th 2020

A simple way can be to compare if import_url (url in log or request) like url (instead to force url/) -> use this website
Also add an option to disable this behavior by default (so no impact on existing instance) and allow to enable it on demand

The bonus can be to allow regex in url (not related to this issue but can be usefull if someone want to use another separator (like "/(!|&)") ^^

@tsteur commented on September 28th 2020 Member

Thanks @anthosz I've updated the title to make it a bit more clear for us. Generally we would likely only be able to support some simply wildcards like * (if that's even possible) as I think we're sometimes might be using the site URLs also for other purposes maybe. To be checked.

Do I see this right it might already help if the include-path parameter in the log importer would support this in your case( eg include-path='/!*')?

@anthosz commented on September 29th 2020

@tsteur yes and no, currently seems to works if we also specify the site ID but the issue is that in this case, we need to execute multiple time the imports_logs and it is slow (especially when we have more 10 millions of lines of logs to parse and multiples websites)

Powered by GitHub Issue Mirror