Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal spec for schema and code updates for logging pageviews, downloads, outlinks #708

Closed
mattab opened this issue May 13, 2009 · 15 comments
Assignees
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@mattab
Copy link
Member

mattab commented May 13, 2009

Proposal spec for the new Actions schema

There are various tickets related to improvements in the "Actions" reporting in Piwik (pages, outlinks, downloads). Some of these require database schema updates to accodomate the new funtionnality. Here is a proposal spec. This spec was written with the following tickets in mind: #306, #530, #556, #707

Requirements

We want to report:

We do not consider an event tracking feature (see #472) as it is out of scope for Piwik 1.0, and would anyway require a different more complicated and modular data structure.

This change is also welcome at this time considering the new Javascript API that has the following related methods:

piwik.setDocumentTitle( customTitle )
piwik.trackPageView()
piwik.trackLink( url, type, customVars)

Users can customize the document title for a given pageview. This would only customize the "Top pages by title report" and would not affect the "top pages URL" and "top pages flattened URLs" reports.

Implications on Tracking (piwik.php, core/Tracker/*)

There is a currently a performance issue when querying piwik_log_action that we hope to resolve with this schema update. Indeed, when querying this table for a given URL, eg. /test/test2/test3/ if most entries in this table have a similar structure, because the index on name is limited to a few characters (15) the index lookup badly fails and we end up doing a full table scan at each page view.
To fix this issue we are going to add a new column in this table, "hash", that will be the hash of the "name" column, using the fast CRC32 algorithm. This will speed up the SELECT idaction in piwik.php.

We also now need to save two information per page view: the name of the page, and the URL.
For clicks and downloads, we will by default only save the URL, but plugins could override this and also save a name, to allow advanced reporting on downloads and outlinks.
We want to keep things simple and will save all names and all URLs in this same table, piwik_log_action.

Schema updates for piwik_log_action

Current schema:

  idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  name VARCHAR(255) NOT NULL,
  type TINYINT UNSIGNED NULL,
  PRIMARY KEY(idaction),
  INDEX index_type_name (type, name(15))

Proposed schema:

  idaction INTEGER(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  hash UNSIGNED INT NOT NULL,
  name VARCHAR(255) NOT NULL,
  type TINYINT UNSIGNED NULL,
  PRIMARY KEY(idaction),
  INDEX index_type_name (type, hash)

Note: this is explained in the book "high performance mysql 2nd edition" page 104;

Basically instead of

SELECT WHERE name = '/path1/path2/path3' 

you do

SELECT WHERE hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3'

you need the AND name = '/path1/path2/path3' in case two urls have the same hash (which happens with 1% probability after 100k results, cf birthday paradox).

CRC32 doc is on http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_crc32.

The hash field should be set when INSERTing new page names/ urls in this table, using the mysql CRC32 function; we don't want to use the php crc at any time.

Currently Actions->getIdAction returns the idaction for the given name. Instead, the function would now set actionIdName and actionIdUrl in the object. For downloads and outlinks, actionIdName would be empty;

In the case of pages, we would like to select these two IDs in only one SQL select for efficiency. For example, it would look like:

SELECT idaction 
FROM piwik_log_action 
WHERE 

    (hash = CRC32('/path1/path2/path3') AND name = '/path1/path2/path3' AND type = $this->getActionType())
  OR 
    (hash = CRC32('Welcome to URL') AND name = 'Welcome to URL' AND type = Piwik_Tracker_Action_Interface::TYPE_ACTION_NAME)

For downloads/outlinks there wouldn't be the part after the OR, as by default we only track URLs for downloads/outlinks;

Eg.


SELECT idaction 
FROM piwik_log_action 
WHERE (hash = CRC32('http://piwik.org/latest.zip') AND name = 'http://piwik.org/latest.zip' AND type = $this->getActionType()

Schema update for piwik_log_link_visit_action

  `idlink_va` int(11) NOT NULL auto_increment,
  `idvisit` int(10) unsigned NOT NULL,
  `idaction` ---> renamed in idaction_url
  NEW idaction_name 
  `idaction_ref` --> renamed in idaction_url_ref
  `time_spent_ref_action` int(10) unsigned NOT NULL,
  PRIMARY KEY  (`idlink_va`),
  KEY `visit` (`idvisit`)

For pages, we would record both idaction_name and idaction_url fetched from piwik_log_action as specified above.
For download and outlinks, we would only set idaction_url, and idaction_name would be null. Plugins could set this value if necessary

Action->record() will have to be updated accordingly to now record these two idaction_*;

Note: Action object already has a getActionName and getActionUrl method which matches nicely; we would maybe add a getActionNameId and getActionUrlId that would be set by the current getActionId (that could be rename loadActionNameAndUrl)?

Schema update for piwik_log_visit

visit_exit_idaction is renamed in visit_exit_idaction_url
visit_entry_idaction is renamed in visit_entry_idaction_url

Several lines have to be updated in the code to reflect this rename but there are no logic changes.

Schema update for piwik_log_conversion

idaction is renamed in idaction_url

GoalManager->recordGoals() query must be updated accordingly.

As shown, all existing idaction reference the entry in piwik_log_action for the URL of the concerned pageview; the URL is now officially the main information about a pageview, and the "name" (eg. the html document title) is one more piece of information about each page view. However conversions, and visits entry/exit pages, only care about the pageview URL, mostly for simplicity.

Note: I would expect to have more small modifications not explained here, in the php code of Tracker/*, especially Tracker/Action.php

Implications on Archiving (plugins/Actions/Actions.php)

Changes in the schema should only affect archiving for plugins/Actions/Actions.php.

To do #530, the first query at line 124 would have to ran twice, once joining on idaction_url for processing reports by page title (existing query), and one more by joining on idaction_name to process the new report "top pages by title".

Other queries would have to be slightly updated to reflect the field's name change.

User Interface

To do #530 we would add a new submenu: Actions> Page titles to show best pages by title, similar to existing Actions>Pages.
There would be a new API method that would return the top pages by title.

Migration

There are several required schema updates which may take minutes to execute on the piwik_log_* tables; we might want to update the message before the update to make clear that this will take a while and user should be patient.

The schema updates must be done in the CREATE TABLE statements in core/Piwik.php, in all queries querying these tables (mentionned above) and the update script can be written in core/Updates/X.php

Other

  • The data generator in core/Tracker/Generator and misc/generateVisits.php will have to be updated to reflect the new DB URL structure and ensure that the generator generates all the title + URL data as expected.
  • Also this new schema change will invalidate the TrackerSecondaryDb which will have to be slightly updated to reflect the new structure.
  • Some of this code is covered by unit tests which will have to be updated (tests/core/Tracker/Action) with new logic

Conclusion

This is not a small change, but we are blocked on several fronts because of the current DB schema; implementing this change will make a lot of other tickets possible and add new features in Piwik in the near future. Also, it makes possible for plugins to custom name outlinks and downloads.

Any questions, feedback, ideas, please let me know: this is a proposal!

@anonymous-matomo-user
Copy link

Seems a good proposal to me! I have a note on saving the page Title.

I've been playing around with Piwik to make it log both Title and url. In PIWIK 0.4 it is possible to log the Title using the following function in the trackingcode:

piwikTracker.setDocumentTitle(document.title);

But in the (utf-8) database this results in problems with both utf-8 encoding and html_entity encoding. I had to put in this code to resolve this problem:

if($actionType == 1){
   $actionName = utf8_decode($actionName);   
   $actionName = html_entity_decode($actionName);
}

I think this is something that needs to be addressed when implementing this proposal.

@mattab
Copy link
Member Author

mattab commented Jun 19, 2009

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

@anonymous-matomo-user
Copy link

Replying to matt:

rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings?

I don't know if the problem relates to utf7 strings.

As far as I remember these steps:

  1. add
piwikTracker.setDocumentTitle(document.title);

in trackingcode.

  1. Look in de database (I use phpMyAdmin) in the table "log_action" and I see a lot of messed up data in the "name" column. But the data only gets messed up for pages with special characters. Characters like "", "" etc. But also special characters in HTML like "'", "&" etc.

does this help?

@mattab
Copy link
Member Author

mattab commented Jun 23, 2009

do you then have wrong data in the "Actions" report as well?

@robocoder
Copy link
Contributor

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);

@mattab
Copy link
Member Author

mattab commented Jun 24, 2009

Replying to vipsoft:

In this forum post, r.guggeis suggests changing the index on the log_action table:

ALTER TABLE `log_action` DROP INDEX `index_type_name`
ALTER TABLE `log_action` ADD INDEX `name` (`name`);

yes this would make sense for the current schema, but the proposal here is an even better solution (we index the hash of the name as opposed to indexing the name which has an unknown length - this could be very costly for websites having very long names)

@zawadzinski
Copy link
Contributor

new version of patch (compatible with current release of Piwik 0.4.3) added.

@zawadzinski
Copy link
Contributor

Attachment:
708.patch

@zawadzinski
Copy link
Contributor

Current patch in the attachment. One important note:

Page Titles are tracked only when user sets in Javascript action_name. If action_name is not given, given page view is not tracked in Page Titles view. As long as user does not set action_name Piwik behaves the same way as before (view Actions->Pages display action URL's).

Backward incompability occurs only when user used action_name's because they will appear in Actions->Pages, and after the update they will go into Actions->Page titles (data in the database won't be updated because we do not have information about which Action was URL and which was not). This is important to note that users have to be aware of before updating to 0.5.

@robocoder
Copy link
Contributor

Attachment: revised update script
0.5.php

@mattab
Copy link
Member Author

mattab commented Oct 23, 2009

see commit in [1530]

TODO before closing the ticket:

  • Anthon: apply small modifs to the JS tracker (setCustomUrl() and action_name defaulting to document.title)
  • Matt: improve/fix unit tests around the expected behavior of Action naming
  • Maciej/Matt: load testing to ensure this is working as fast as we expect under heavy load, with a huge log_action table

@robocoder
Copy link
Contributor

In [1531]

@mattab
Copy link
Member Author

mattab commented Nov 2, 2009

marking as fixed for now - thanks all for your work

@robocoder
Copy link
Contributor

what about the two unit tests failing in tests/core/Tracker/Action.test.php (empty request) and plugins/Actions/tests/Actions.test.php ?

@robocoder
Copy link
Contributor

[1550], refs #708

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Indicates the severity of an issue is very critical and the issue has a very high priority. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

4 participants