Pages might have multiple URLs, especially if a site has multiple domains.
The basic cases are
Why this is not enough
For upcoming features, we need data based on URLs. One URL might have multiple idactions.
WHERE idaction IN ( ... ) works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can use
idaction_url_ref IN ( ... ) but you cannot recognize aliases in the result (column idaction) on the DB level.
I think this is an acceptable edge case. From prior discussion, SEO would discourage the use of alias domains.
We need to focus on performance, because this feature will be potentially very slow. We can't afford adding a new field to log_action to deal with this edge case. To maximise performance, solution 4) is the way to go. It is not a big drawback IMO because most websites do not use alias.
vote for wontfix/worksforme/ acceptable edge case.
Note: a relevant new FAQ to put up to help with this issue is to explain how to track the custom URL to be the canonical URL (if websites generate it, most CMS/apps do these days). See #2974 for code example
See also #2805 (Purge logs from log_action) - could this be done if we happen to change URL storage algorithm?
After discussing further with timo, we came to the conclusion that storing protocol + hostname for Page URLS (type==1) in log_action table is not currently useful very much.
Therefore, it makes sense to consider not storing the protocol+hostname in the future.
TODO: New FAQ to explain how to do alias domain URL segmentation.
Because some users might use pageUrl, exitPageUrl, entryPageUrl segmentation to test various aliases performance, we should document, before releasing this change, how to do Domain segmentation via Custom Vars.
Typically, user could record the hostname in a custom variable of scope "page". Then it could segment the reports using
This will first normalize the pageUrl to "page.html" then select the idaction, then segment further only page views that had a custom variable "hostname" set to "example.org".
This would achieve the current behavior of
The benefits of this change far outweight the downsides. The very slow upgrade will be a challenge, but a necessary one if we want to keep innovating :)
After some discussion via email, here's an updated version of the ticket description.
Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.
I think that's OK for V2 but we don't need the switch for V1, as this is an edge case and I think will not be useful for many people.
Hopefully you can manage to do the update without stored functions indeed, as I'm pretty sure many users will not be allowed to create the functions on shared hosts.
Because now the normalization is simple, maybe we can do something like:
select <a class='mention' href='https://github.com/prefix_type'>@prefix_type</a> := ( case when LEFT(name, 10) = 'http://www' then 'http://www' when LEFT(name, 7) = 'http://' then 'http://' case when LEFT(name, 11) = 'https://www' then 'https://www' when LEFT(name, 8) = 'https://' then 'https://' else '' end), name, IF(<a class='mention' href='https://github.com/prefix_type'>@prefix_type</a> <> '', REPLACE(name,<a class='mention' href='https://github.com/prefix_type'>@prefix_type</a>,''), name ) as nameafter from piwik_log_action where type=1 #AND idaction > 730641
Trac won't let me add files with 1K lines. So here's the patch in gist: https://gist.github.com/2157026 . It consists of two commits I have in a git branch. I hope you can manage to apply it to your SVN working copy.
Please review the patch carefully.
Patch looks good, thanks. +1 for the tests!
I didn't apply it but I generally don't apply patches anyway, just read them.
The only potential problem I can think of is the inserting of NULL values, which I remember has some issues with Mysqli. But, if that's the case, the beauty is that tests will fail when commmitted on jenkins, so we shall see.
(In ) refs #2976 url normalization: store protocol and www in the url_prefix column of log_action. treat pages with different protocol or with/without www as the same action. includes a major db transformation and tests.
It would be good if as many eyes as possible could review this update since it's quite critical.
Looks like I broke the build...
/home/www/data/root/jenkins.private/jobs/Piwik/workspace/build/plugins/PrivacyManager/tests/PrivacyManager.test.php -> Test_Piwik_PrivacyManager -> test_purgeData_deleteReportsKeepRangeReports -> Unexpected exception of type [with message [SQLSTATE42000: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' ,
, , 1000)' at line 4
What is this trying to tell me??
This particular test case doesn't complete on my machine at all.
(In ) Adding quick tip about disabling maintenance and re-enable tracker refs #2976
Great code Timo!
It will be useful to have this data cleaned up and simplified int he DB...
This wasnt an easy upgrade script to write, you did really well
Not much to add, good stuff. Marking as fixed
(In ) No duplicate code + Testing for entryPageUrl/exitPageUrl Refs #2976