Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve storage of URLs, normalization at DB Level #2976

Closed
timo-bes opened this issue Feb 26, 2012 · 21 comments
Closed

Improve storage of URLs, normalization at DB Level #2976

timo-bes opened this issue Feb 26, 2012 · 21 comments
Assignees
Labels
Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Milestone

Comments

@timo-bes
Copy link
Member

Problem

Pages might have multiple URLs, especially if a site has multiple domains.

The basic cases are

  1. Alias domains
  2. With and without www
  3. http and https

Current solution

  • For the Actions report, normalization is done implicitly in Piwik_Actions::getActionExplodedNames. The domain is removed, which takes care of (2) and (3). It also handles (1) correctly if the domains are real aliases, but it fails if the aliases are domain1.com/ and domain2.com/something/.
  • For the segement pageUrl, one can use =@ (i.e. contains) and omit domain and protocol.

Why this is not enough

For upcoming features, we need data based on URLs. One URL might have multiple idactions.

Using WHERE idaction IN ( ... ) works in some cases but not always. E.g. if you want the pages visitors viewed directly after a certain page you can use idaction_url_ref IN ( ... ) but you cannot recognize aliases in the result (column idaction) on the DB level.

Proposed solution

  1. Do some normalization before writing to log_action. We would not record protocol and www for page URLs ie. type == TYPE_ACTION_URL == 1.
  2. Add the information about protocol and www as a TINYINT to log_action. 1 = http://, 2 = http://www, 3 = https://, 4 = https://www. This information is added when new actions are inserted and never modified. It can be used to reconstruct the full URL. When actions should be tracked with the same URL but different TINYINT, use the existing idaction.
  3. Analysis can be done without worrying about problem (2) and (3).
  4. Problem (4) - alias domains - is acceptable. Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.
  5. We need to upgrade existing databases. Transform log_action, find duplicates, change foreign keys to duplicates in other tables, remove duplicates. Ideally, this is done without using stored procedures / functions and without data shipping to PHP.
@robocoder
Copy link
Contributor

s/remote/remove/

@robocoder
Copy link
Contributor

I think this is an acceptable edge case. From prior discussion, SEO would discourage the use of alias domains.

@mattab
Copy link
Member

mattab commented Feb 26, 2012

We need to focus on performance, because this feature will be potentially very slow. We can't afford adding a new field to log_action to deal with this edge case. To maximise performance, solution 4) is the way to go. It is not a big drawback IMO because most websites do not use alias.

vote for wontfix/worksforme/ acceptable edge case.

@mattab
Copy link
Member

mattab commented Feb 26, 2012

Note: a relevant new FAQ to put up to help with this issue is to explain how to track the custom URL to be the canonical URL (if websites generate it, most CMS/apps do these days). See #2974 for code example

@mattab
Copy link
Member

mattab commented Feb 27, 2012

See also #2805 (Purge logs from log_action) - could this be done if we happen to change URL storage algorithm?

@mattab
Copy link
Member

mattab commented Feb 27, 2012

After discussing further with timo, we came to the conclusion that storing protocol + hostname for Page URLS (type==1) in log_action table is not currently useful very much.

  • Storing hostname is only used when using the segments pageUrl, exitPageUrl, entryPageUrl.
  • However storing hostname for all rows where type==1 in log_action causes extra disk usage and prevents us from writing efficient and user friendly new features

Therefore, it makes sense to consider not storing the protocol+hostname in the future.
Proposed solution

  • Tracking: We would not record protocol+hostname for Page URLs ie. type== TYPE_ACTION_URL ==1
    • log_action rows for type TYPE_OUTLINK = 2; TYPE_DOWNLOAD = 3; TYPE_ACTION_NAME = 4; would be unchanged and still have full URL
  • Upgrade: The upgrade procedure would be quite slow. Timo is working on a test upgrade that we will try on the demo. Definitely high traffic piwik users will have to stop tracking and disable UI during the upgrade.
    • Post upgrade: the log_action table would be smaller (protocol+host stripped out)
    • Future developments will not have to worry about URL normalization. It will also improve user friendliness since URLs will be automatically aggregated.
  • Archiving: We would not put the hostname in the Actions Page URLs datatables.
  • API: The hostname would be prepended in the metadata 'url' column in the Actions.getPageUrls API call
    • This change to Archiving+API would result in smaller future Actions Datatable (not a significant improvement but always nice)

TODO: New FAQ to explain how to do alias domain URL segmentation.

  • Because some users might use pageUrl, exitPageUrl, entryPageUrl segmentation to test various aliases performance, we should document, before releasing this change, how to do Domain segmentation via Custom Vars.
    Typically, user could record the hostname in a custom variable of scope "page". Then it could segment the reports using &segment=customVariablePageName1==example.org;pageUrl=http://this-domain-is-not-used/page.html

    This will first normalize the pageUrl to "page.html" then select the idaction, then segment further only page views that had a custom variable "hostname" set to "example.org".

    This would achieve the current behavior of &segment=pageUrl=http://example.org/page.html.

  • document in the segmentation page the fact that segment pageUrl, entryPageUrl, exitPageUrl do not use the hostname.

    • This is the main downside of this change: that segments implementation is not really what their name mean, because only the path+querystring from the specified segment valueswill be used (rather than the whole URL).

The benefits of this change far outweight the downsides. The very slow upgrade will be a challenge, but a necessary one if we want to keep innovating :)

@timo-bes
Copy link
Member Author

timo-bes commented Mar 6, 2012

After some discussion via email, here's an updated version of the ticket description.

@mattab
Copy link
Member

mattab commented Mar 6, 2012

Maybe we can add a switch to the site configuration where users can configure Piwik to treat the domains as aliases. If set, the alias domains will be replaced with the main domain when tracking. By default, this option is off.

I think that's OK for V2 but we don't need the switch for V1, as this is an edge case and I think will not be useful for many people.

Update script
Hopefully you can manage to do the update without stored functions indeed, as I'm pretty sure many users will not be allowed to create the functions on shared hosts.

Because now the normalization is simple, maybe we can do something like:

select @prefix_type := (
case when LEFT(name, 10) = 'http://www' then 'http://www' 
when LEFT(name, 7) = 'http://' then 'http://' 
case when LEFT(name, 11) = 'https://www' then 'https://www'
when LEFT(name, 8) = 'https://' then 'https://' 
else '' 
end), 
name, 
IF(@prefix_type <> '', REPLACE(name,@prefix_type,''), name ) as nameafter
from piwik_log_action
where type=1
#AND idaction > 730641 

@timo-bes
Copy link
Member Author

Trac won't let me add files with 1K lines. So here's the patch in gist: https://gist.github.com/2157026 . It consists of two commits I have in a git branch. I hope you can manage to apply it to your SVN working copy.

  • All the integration tests pass and I added a new test suite for the normalization.
  • I called the update 1.7.3-b1, assuming that we release 1.7.2 before we do this. But the version can be changed when I commit to trunk.

Please review the patch carefully.

@mattab
Copy link
Member

mattab commented Mar 23, 2012

Patch looks good, thanks. +1 for the tests!
I didn't apply it but I generally don't apply patches anyway, just read them.

The only potential problem I can think of is the inserting of NULL values, which I remember has some issues with Mysqli. But, if that's the case, the beauty is that tests will fail when commmitted on jenkins, so we shall see.

great stuff!

@timo-bes
Copy link
Member Author

(In [6790]) refs #2976 updates can be marked a major

@timo-bes
Copy link
Member Author

(In [6791]) refs #2976 fixing outdated doc of Piwik_Config

@timo-bes
Copy link
Member Author

(In [6792]) refs #2976 url normalization: store protocol and www in the url_prefix column of log_action. treat pages with different protocol or with/without www as the same action. includes a major db transformation and tests.

@timo-bes
Copy link
Member Author

It would be good if as many eyes as possible could review this update since it's quite critical.

@timo-bes
Copy link
Member Author

(In [6794]) refs #2976 svn properties

@timo-bes
Copy link
Member Author

Looks like I broke the build...

/home/www/data/root/jenkins.private/jobs/Piwik/workspace/build/plugins/PrivacyManager/tests/PrivacyManager.test.php -> Test_Piwik_PrivacyManager -> test_purgeData_deleteReportsKeepRangeReports -> Unexpected exception of type [with message [SQLSTATE42000: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ' ,
, , 1000)' at line 4

What is this trying to tell me??

This particular test case doesn't complete on my machine at all.

@sgiehl
Copy link
Member

sgiehl commented Aug 17, 2012

(In [6802]) refs #2976 fixed privacymanager tests

@mattab
Copy link
Member

mattab commented Aug 17, 2012

(In [6808]) Adding quick tip about disabling maintenance and re-enable tracker refs #2976

@mattab
Copy link
Member

mattab commented Aug 17, 2012

Great code Timo!

It will be useful to have this data cleaned up and simplified int he DB...
This wasnt an easy upgrade script to write, you did really well

  • Tests look good, testing both the segmentation, and that the log_action is correctly built, and case sensitivity..

Not much to add, good stuff. Marking as fixed

@mattab
Copy link
Member

mattab commented Aug 17, 2012

(In [6814]) No duplicate code + Testing for entryPageUrl/exitPageUrl Refs #2976

@mattab
Copy link
Member

mattab commented Sep 19, 2012

in [7017] Refs #2976 Removing comments from the SQL as, when we display SQL, it gets all inlined and #Xxxx comments gets inlined causing SQL to fail. Reported in forums eg. http://forum.piwik.org/read.php?2,93934

@timo-bes timo-bes added this to the 1.12.x - Piwik 1.12.x milestone Jul 8, 2014
@timo-bes timo-bes self-assigned this Jul 8, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical. Task Indicates an issue is neither a feature nor a bug and it's purely a "technical" change.
Projects
None yet
Development

No branches or pull requests

4 participants