New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal spec for schema and code updates for logging pageviews, downloads, outlinks #708
Comments
Seems a good proposal to me! I have a note on saving the page Title. I've been playing around with Piwik to make it log both Title and url. In PIWIK 0.4 it is possible to log the Title using the following function in the trackingcode:
But in the (utf-8) database this results in problems with both utf-8 encoding and html_entity encoding. I had to put in this code to resolve this problem:
I think this is something that needs to be addressed when implementing this proposal. |
rc_amsterdam, can you please submit steps to reproduce the error? what exactly happens if you submit utf7 strings? |
Replying to matt:
I don't know if the problem relates to utf7 strings. As far as I remember these steps:
in trackingcode.
does this help? |
do you then have wrong data in the "Actions" report as well? |
In this forum post, r.guggeis suggests changing the index on the log_action table:
|
Replying to vipsoft:
yes this would make sense for the current schema, but the proposal here is an even better solution (we index the hash of the name as opposed to indexing the name which has an unknown length - this could be very costly for websites having very long names) |
new version of patch (compatible with current release of Piwik 0.4.3) added. |
Attachment: |
Current patch in the attachment. One important note: Page Titles are tracked only when user sets in Javascript action_name. If action_name is not given, given page view is not tracked in Page Titles view. As long as user does not set action_name Piwik behaves the same way as before (view Actions->Pages display action URL's). Backward incompability occurs only when user used action_name's because they will appear in Actions->Pages, and after the update they will go into Actions->Page titles (data in the database won't be updated because we do not have information about which Action was URL and which was not). This is important to note that users have to be aware of before updating to 0.5. |
Attachment: revised update script |
see commit in [1530] TODO before closing the ticket:
|
In [1531]
|
marking as fixed for now - thanks all for your work |
what about the two unit tests failing in tests/core/Tracker/Action.test.php (empty request) and plugins/Actions/tests/Actions.test.php ? |
[1550], refs #708 |
Proposal spec for the new Actions schema
There are various tickets related to improvements in the "Actions" reporting in Piwik (pages, outlinks, downloads). Some of these require database schema updates to accodomate the new funtionnality. Here is a proposal spec. This spec was written with the following tickets in mind: #306, #530, #556, #707
Requirements
We want to report:
In the top pages by title, titles don't link to the URLs by default, because it is a lot of computation to keep the relationship page title<> (URL1, URL2, ..) as a same title can have many URLs.
Also, we don't report best entry page, top exit page, or time per page for each title. These analysis are only done for the URLs, to minimize overhead during archiving. (see Report actions by html page title as well as reporting by URL (or custom page name) #530)
However we now need two different information about a pageview: URL, and a Name.
We do not consider an event tracking feature (see #472) as it is out of scope for Piwik 1.0, and would anyway require a different more complicated and modular data structure.
This change is also welcome at this time considering the new Javascript API that has the following related methods:
Users can customize the document title for a given pageview. This would only customize the "Top pages by title report" and would not affect the "top pages URL" and "top pages flattened URLs" reports.
Implications on Tracking (piwik.php, core/Tracker/*)
There is a currently a performance issue when querying piwik_log_action that we hope to resolve with this schema update. Indeed, when querying this table for a given URL, eg.
/test/test2/test3/
if most entries in this table have a similar structure, because the index on name is limited to a few characters (15) the index lookup badly fails and we end up doing a full table scan at each page view.To fix this issue we are going to add a new column in this table, "hash", that will be the hash of the "name" column, using the fast CRC32 algorithm. This will speed up the SELECT idaction in piwik.php.
We also now need to save two information per page view: the name of the page, and the URL.
For clicks and downloads, we will by default only save the URL, but plugins could override this and also save a name, to allow advanced reporting on downloads and outlinks.
We want to keep things simple and will save all names and all URLs in this same table, piwik_log_action.
Schema updates for piwik_log_action
Current schema:
Proposed schema:
Note: this is explained in the book "high performance mysql 2nd edition" page 104;
Basically instead of
you do
you need the
AND name = '/path1/path2/path3'
in case two urls have the same hash (which happens with 1% probability after 100k results, cf birthday paradox).CRC32 doc is on http://dev.mysql.com/doc/refman/5.1/en/mathematical-functions.html#function_crc32.
The hash field should be set when INSERTing new page names/ urls in this table, using the mysql CRC32 function; we don't want to use the php crc at any time.
Currently Actions->getIdAction returns the idaction for the given name. Instead, the function would now set actionIdName and actionIdUrl in the object. For downloads and outlinks, actionIdName would be empty;
In the case of pages, we would like to select these two IDs in only one SQL select for efficiency. For example, it would look like:
For downloads/outlinks there wouldn't be the part after the OR, as by default we only track URLs for downloads/outlinks;
Eg.
Schema update for piwik_log_link_visit_action
For pages, we would record both idaction_name and idaction_url fetched from piwik_log_action as specified above.
For download and outlinks, we would only set idaction_url, and idaction_name would be null. Plugins could set this value if necessary
Action->record() will have to be updated accordingly to now record these two idaction_*;
Note: Action object already has a getActionName and getActionUrl method which matches nicely; we would maybe add a getActionNameId and getActionUrlId that would be set by the current getActionId (that could be rename loadActionNameAndUrl)?
Schema update for piwik_log_visit
visit_exit_idaction is renamed in visit_exit_idaction_url
visit_entry_idaction is renamed in visit_entry_idaction_url
Several lines have to be updated in the code to reflect this rename but there are no logic changes.
Schema update for piwik_log_conversion
idaction is renamed in idaction_url
GoalManager->recordGoals() query must be updated accordingly.
As shown, all existing idaction reference the entry in piwik_log_action for the URL of the concerned pageview; the URL is now officially the main information about a pageview, and the "name" (eg. the html document title) is one more piece of information about each page view. However conversions, and visits entry/exit pages, only care about the pageview URL, mostly for simplicity.
Note: I would expect to have more small modifications not explained here, in the php code of Tracker/*, especially Tracker/Action.php
Implications on Archiving (plugins/Actions/Actions.php)
Changes in the schema should only affect archiving for plugins/Actions/Actions.php.
To do #530, the first query at line 124 would have to ran twice, once joining on idaction_url for processing reports by page title (existing query), and one more by joining on idaction_name to process the new report "top pages by title".
Other queries would have to be slightly updated to reflect the field's name change.
User Interface
To do #530 we would add a new submenu: Actions> Page titles to show best pages by title, similar to existing Actions>Pages.
There would be a new API method that would return the top pages by title.
Migration
There are several required schema updates which may take minutes to execute on the piwik_log_* tables; we might want to update the message before the update to make clear that this will take a while and user should be patient.
The schema updates must be done in the CREATE TABLE statements in core/Piwik.php, in all queries querying these tables (mentionned above) and the update script can be written in core/Updates/X.php
Other
Conclusion
This is not a small change, but we are blocked on several fronts because of the current DB schema; implementing this change will make a lot of other tickets possible and add new features in Piwik in the near future. Also, it makes possible for plugins to custom name outlinks and downloads.
Any questions, feedback, ideas, please let me know: this is a proposal!
The text was updated successfully, but these errors were encountered: