Currently Piwik is using several third party cookies. we want Piwik to create, by default, 1st party cookies only. This is mainly for privacy reasons, but also for better accuracy in counting unique visitors (1st party cookies are more often accepted and less often deleted by users)
This ticket is a requirement for #134 and #1984
Keywords: scalability, cookie, 1st party cookie
+1 for this
Any news? We have piwik deployed to track widgets views (LOTS of hits from differents domains) and we are forced to increase header size in apache...
same issue here. I already had to increase allowed header size in nginx 2 times with just a couple thousand sites.
This is planned to be fixed before Piwik 1.0, which means in the next 2 months. If you can help with implementation or testing, please let us know. This is def a high priority issue.
We should do the quick fix solution for 1.0, ensuring we store the last websites data, up to a reasonnable limit (1kb?). If a cookie does on average 200b we could still store 5 sites without failing as it is now.
We could then do the scalable long term solution post 1.0.
The goal would be to slightly update the Cookie mechanism in Tracker to have it store a total max of 1kb, discarding older tracking cookies.
(In [2777]) Refs #409
Any news on when Piwik is going to support 1st party cookies?
3rd party cookies are a less well-accepted. Not only by browsers, but also by people.
I think it'll be good for stats, for Piwik PR and Piwik acceptance to switch over.
thanks!
When implemented, we should also have the PiwikTracker api class set the 1st party cookie forwarded from the piwik server response.
In [3544], I added core/Tracker/Cookie.php to encapsulate the ignore_cookie. But it too suffers from the third-party cookie issue.
Replying to matt:
When implemented, we should also have the PiwikTracker api class set the 1st party cookie forwarded from the piwik server response.
The first-party "cookie" will actually be a UUID (not necessarily rfc4122 compliant) generated by piwik.js and passed to piwik.php via a new parameter. Any allowed third-party cookies will continue to be signed and sent via the Cookie: header.
The tracker session table will map first and third party visitor id_cookies (plus idsite to act as indices) to rows that contain the former cookie store.
Use cases for this feature:
Requirements piwik.js
Math.round(new Date().getTime() / 1000)
This will be used to process 'Days to conversion' for goal conversions. IF the visit is new (ie. there was no cookie _pk_ses when track* was called initially)
AND there is a referer URL which domain is not the current domain, or any subdomain set in setDomainNames
AND (_pk_ref is empty // if _pk_ref cookie is not set, we always set it
OR setConversionAttributionFirstReferer == false // if _pk_ref cookie is already set, but overwrite the value since we want to attribute last known referer
OR _pk_ref is set AND hostname of _pk_ref URL is the current domain, OR any subdomain // the _pk_ref was set to a referer, but as we evaluate this URL again now, it seems this URL does not fit the spec. This could happen if a _pk_ref URL was set earlier, and then user updated website to setDomainNames(..). We want to improve visitors cookies data in this case.
)
THEN update _pk_ref with current referer URL truncated 1k
Requirements piwik.php
Documentation:
Ideas for V2
Also I think the piwik_ignore cookie should stay 3rd party (and signed), to avoid abuse.
(In [3634]) Fixes #1916
Now always checking in the DB if we saw the visitor earlier. The cookie also becomes much smaller.
Renamed the setting enable_detect_unique_visitor_using_settings now called trust_visitors_cookies as it is different logic, and should only be enabled in intranet where IP is same for all users.
This will also help getting 1st party cookie implemented Refs #409
Also we need to think about subdomains tracking and first party cookies. How does GA handle this for example? see for reference: http://www.roirevolution.com/blog/2011/01/google_analytics_subdomain_tracking.php
and http://www.dannytalk.com/how-to-track-sub-domains-cross-domains-in-google-analytics/
matt: do you still want this one? It doesn't appear in the request. To manage this on the client, requires also keeping track of the timestamp for the most recent page view of the current visit.
Because of this condition:
AND there is a referer URL which domain is not the current domain, or any subdomain set in setDomainNames
_pk_ref will never contain a referer for the current domain or subdomain; so, this expression will never be true:
OR _pk_ref is set AND hostname of _pk_ref URL is the current domain, OR any subdomain
re: comment:38 - oops, I didn't scroll all the way to the right to read your comment; got it
The timestamp in comment:37 is still an open question.
Also, you mention that _pk_ref "Contains time at which ref URL was set", but this timestamp doesn't appear in the request either. (If I store this in the cookie, I need to change the delimeter, as the referrer may contain '.')
To manage this on the client, requires also keeping track of the timestamp for the most recent page view of the current visit.
OK that's right, this timestamp can also be saved in the cookie (ie. _pk_ses cookie?)
"Contains time at which ref URL was set", but this timestamp doesn't appear in the request either. (If I store this in the cookie, I need to change the delimeter, as the referrer may contain '.')
what do you mean by "it doesnt appear in the request"? I mean, the _pk_ref must contain the URL as well as the client timestamp when the cookie was last updated with a ref URL.
Thx
I mean your specification doesn't show any parameters in the request to piwik.php for these timestamps.
&_viewts=TIMESTAMP_OF_LAST_PAGE_VIEW_OF_LAST_VISIT
&_refts=TIMESTAMP_OF_REFERRAL
If I understand _pk_ses correctly, the timestamp of the most recent page view (cvts) would have to instead be stored in _pk_id.
Indeed, I now updated the request to add these 2 timestamp
Also I'm not sure what I meant by: &_ses=1 in the URL... ? maybe this is not useful.
Maybe if _ses=0, the server should use third-party cookies?
(In [3783]) refs #409 - first party cookies
var visitorId;
_paq.push(function () {
visitorId = this.getVisitorId();
});
removed internal dropCookie() method as it was never used
@todo Missing unit tests and cross browser testing
refs #739 - piwik.js improvements
refs #752 - track middle mouse button clicks (via mousedown+mouseup pseudo-click handler); defaults to tracking true "clicks"
refs #1984 - custom variables vs custom data
@todo These are just stubs.
tracker.setCustomData(null);
(In [3784]) refs #409 - use getCookieName() in hasCookies() test
Mark as fixed. Future commits to #1984.
I still have to do some work :)
Also,
JS code review
Questions/feedback
_ref undefined
_refts undefined
_viewts undefined
I think these should be set only when they have a value
PREFIXid.1fffd42e=fb6f5c3ec259b00e.1295573291.1.1295573291.undefined;
Pending more items as well docs
Replying to matt:
- Are ref URLs encoded by default? in the cases where: it comes from the browser itself, OR when it was set via setReferrerUrl ?
Browser-dependent. We have to encode it in case it isn't.
- If ref URL can contain a space (ie. sometimes not encoded), it will record a bogus cookie - should ref.split(' '); be ref.split(' ', limit = 1) ?
Good point. I've changed it to use limit=1 and '.' as a separator (consistent with id).
- Referrer url doesn't seem to be truncated at 1k, important for keeping cookie space in control
No, it isn't. The spec is 4K. The actual limit is browser dependent, and also subject to server configuration limits.
- Running the new JS for the first time, I see in the http request:
I'll fix that.
- can all cookie timeout methods take seconds as input? this is less risky (if they enter the timeout in seconds but expects milliseconds, things will break), but also more consistent/user friendly
This is for consistency with G.
- getVisitorId() returns undefined (visitorId not set)
I'll fix that.
- I looked at the cookie after some testing, and noticed the last field of 'id' cookie is undefined: PREFIXid.1fffd42e=fb6f5c3ec259b00e.1295573291.1.1295573291.undefined;
Same bug as running JS for the first time.
- I don't think we need enableServerCookies(): enabling 3rd party cookies will be done in server side via config setting, will the client side have a use?
Another analytics offers a thirdParty setting via JS. Removed for now.
Replying to matt:
- For compability with https pages, the cookie secure flag should be set automatically based on the current URL protocol (in setCookie())
Ok.
(In [3789]) refs #409 - remove enableServerCookies(); fix bugs found in matt's review
(In [3794]) refs #409 - set secure flag in cookies per comment:51
(In [3797]) refs #409 - rename setConversionAttributionFirstReferer to setConversionAttributionFirstReferrer for correctness/consistency, i.e., referrer/referral
(In [3814]) refs #409 - reorg js unit tests
(In [3817]) refs #409 - added setDoNotTrack(bool); updated jslint to 2011-01-26
(In [3818]) refs #409 - small optimization to r3817
_ref is showing up undefined in my logs; I'll fix this and add some more unit tests (tomorrow?)
Replying to vipsoft:
Replying to matt:
- Are ref URLs encoded by default? in the cases where: it comes from the browser itself, OR when it was set via setReferrerUrl ?
Browser-dependent. We have to encode it in case it isn't.
OK, should JS ensure all URLs are encoded before working on them?
- Referrer url doesn't seem to be truncated at 1k, important for keeping cookie space in control
No, it isn't. The spec is 4K. The actual limit is browser dependent, and also subject to server configuration limits.
A cookie too big is not desirable as it will show up in all http request and slow the page load,plus it could cause other problems with cookie space.
we must truncate at some lenght, maybe 2k?
- can all cookie timeout methods take seconds as input? this is less risky (if they enter the timeout in seconds but expects milliseconds, things will break), but also more consistent/user friendly
This is for consistency with G.
OK, I vote for using seconds as ms doesn't make sense in this case. Let's not follow GA API since it will cause user errors (and we have already a few differences anyway)
OK for other modifications, good stuff. Is there anything still open appart from the points above?
We already assume URLs are decoded when working on them. Values are decoded by getCookie; conversely, values are encoded by setCookie and sendRequest. I don't see any need to change this.
This isn't a problem that we need to solve. Users may want to be aware of potential limits, but they shouldn't be artificially constrained. Tracking requests are sent asynchronously, and shouldn't affect page load time. Loading piwik.js (minified at 14K), when it isn't in the cache, has more impact on page load times.
I'll change the API methods to expect seconds, but we should do so for all methods. For setLinkTrackingTimer() this will be a compat-buster.
As an observation, when Piwik is on the same domain as the site being tracked, first party cookies will be sent in the Cookie: header, in addition to being in the tracking request. Some ideas would be to (a) leave this as is, (b) add a method to disable first party cookies, or (c) detect when the site being tracked and tracker are on the same domain and in this case, shorten the request string by excluding the cookie values.
(In [3846]) refs #409 - fix _ref=undefined bug caused by split('.', 1); also external API methods now expect seconds, and convert to milliseconds internally
Replying to vipsoft:
I'll change the API methods to expect seconds, but we should do so for all methods. For setLinkTrackingTimer() this will be a compat-buster.
Done.
As an observation, when Piwik is on the same domain as the site being tracked, first party cookies will be sent in the Cookie: header, in addition to being in the tracking request. Some ideas would be to (a) leave this as is, (b) add a method to disable first party cookies,
or (c) detect when the site being tracked and tracker are on the same domain and in this case, shorten the request string by excluding the cookie values.
The problem with (c) is that the cookies are unsigned, so the server discards the value.
(d) detect when the site being tracked and tracker are on the same domain, and in this case, automatically disable first party cookies
fwiw I think the redundancy in the Cookie: header is a low priority -- it isn't a problem we need to solve now.
setLinkTrackingTimer is fine in milliseconds, since it requires this precision (which is not needed/desired for cookie timeouts). We can clarify what parameter we expect in the documentation and in the parameter names. I vote for revert as introducing an API change in the documented method at this stage is not possible - thoughts?
My concern with cookie sizes was purely around slowing down the whole website experience, since 1st party cookies are in the cookie headers. So with a 2k cookies, fetching 10 images and 5 other resources will cause an overhead of 2k * 15 = 30k data transmitted over http, which could result in worsen user experience. I still think we must truncate to 1 or 2k, but agreed that this should be documented and maybe could be changed via a new setConversionReferrerUrlTruncation() or something similar.
(In [3852]) refs #409 - revert API change to setLinkTrackingTimer()
Since the conversion referral URL is set (if needed) at the beginning of a new session and used (currently) at most once per visit, one idea would be to store this server side. This would minimize the cookie size and transmission overhead; the tradeoff is executing some extra (albeit infrequent) SQL on the server.
There's also a small privacy/security issue with storing the referral URL in a cookie.
vipsoft, I updated my comment about the visitor log table new feature, see #1434 - I think it would be best to go this way in the future indeed. Just more overhead for more features :)
Ok. Hopefully it won't take as long as it did this ticket... ;)
(The space/transmission overhead gets worse when there are multiple trackers on the same page, using different cookie name prefixes.)
(In [3868]) refs #409 - add back legacy tracking; update jslint
(In [3888]) Refs #409
; if set to 0, any goal conversion will be credited to the last more recent non empty referer.
; when set to 1, the first ever referer used to reach the website will be used
use_first_referer_to_determine_goal_referer = 0
; Piwik uses first party cookies by default. If set to 1,
; the visit ID cookie will be set on the Piwik server domain as well
; this is useful when you want to do cross websites analysis
use_third_party_cookies = 0
(In [3892]) Refs #409
(In [3893]) Refs #409 Disabling getVisitorId() for now as it doesn't work when called before track* (the object should init the uuid member before getRequest())
Would be nice to have though, to make it trivial to get the visitorId from piwik into other systems (Salesforce, Form fill), and then also allow querying the Live! API to fetch data about this visitor.
I think all outstanding points, appart from JS tests and JS Doc, are in trunk and working?
We can't reliably retrieve an existing uuid until the cookie domain, path, and prefix are definite. If we pre-initialize it and then re-read the cookie each time domain, path, or prefix is changed, then the side effect is that the uuid may be differ depending on when getvisitorid is called.
Vote to either re-enable the as-implemented behaviour or remove this feature entirely.
My idea was to have getVisitorId() call a loadIdCookie or similar, that would only pre-load this cookie so we can read it. User should call the getVisitorId when all setCookie* have been called, but he shouldn't have to call it after track*, since he might require it before we can wait for the request (eg. when sending a form in the page, wanting to attach the Piwik ID)
(In [3939]) refs #409:
refs #1984:
refs #2078 Webkit bug ("Failed to load resource") when link target is the current window/tab
if ((new RegExp('WebKit')).test(navigatorAlias.userAgent)
&& (!sourceElement.target.length || sourceElement.target === '_self')
&& linkType === 'link')
{
// open outlink in a new window
sourceElement.target = '_blank';
}
(In [3960]) refs #409 - add site ID to cookie name; shorten domain hash to 16 bits (4 hexit characters)
This is a hybrid between the previous implementation and what I proposed.
Decided not to auto-set www.example.com's cookie domain=.example.com -- as the convenience introduces side-effects, and I have a feeling will be more trouble than beneficial. Will continue to leave it to the user to explicitly set the cookie domain. Users should be advised to redirect example.com to www.example.com (or vice-versa) to:
a) to avoid separate cookies between the two domains, and
b) to improve SEO. (Google for "seo www vs no www".)
the revision [3960] is leading to a lot of breaking on our sites. We track multiple siteIds using the same domain name. On each request we call trackPageView twice once for each siteid. The new mechanism of adding the site id to the cookie name is causing the headers to overflow the server buffers. Leading to numerous errors on our server.
The goal that was to be achieved by this change i believe was to be able to track different site Ids for the sub domains. But if that is a requirement of the application then the application should do so by calling trackPageView twice or three times.
The current implementation would lead to an endless increase in the number of cookies as the user moves from one site to the next. which is what is happening on our side.
Sam: it depends how you use piwik.js. (fyi the reason for the hash is mentioned in comment:44)
Are you using two sites ids across the entire site?
Or many more? eg one site-wide, and another that varies/depends on some area of the website? In this scenario, you should use setCookiePath()
Can you see if the TrackSiteByUrl plugin can be adapted for your environment?
I looked at comment:44 the only thing I can see regarding the relevance would be the social network example. Unfortunately I couldn't find the description of this use case.
In our setup, we are using the same URL for all the various sites so cookie paths are not going to work.
Regarding the site ids we have one siteId that is site wide and another one based on the client. We have thousands of clients that we track. we developed our own plugin that creates the site during our own account creation process, retrieves the site id and embed that into the db for tracking.
One thing to note here is if you think about it abstractly, you have one user one browser. Can that user really have multiple identities, referral URLs ..etc based on the siteId????!!! I think this fix is trying to do something Piwik shouldn't be responsible for.
This is on a side note, but referral URLs can be monstrous in size. adding them to cookies can be a real pain on the server. if you already have them in the db based on the visitor id/siteId should they really be in the cookie as well?
The hash addresses the subdomain cookie leak problem in Firefox.
Each tracker instance can point to a different Piwik server. If you're using cookie domains and/or paths, then it is possible for the cookie contents to be different.
But if that is case then you can create the hash based on the piwik domain url instead of based on the siteId.
I still don't understand how adding the site Ids will fix the FF issue. Wouldn't the cookies still leak to the subdomains?
Regarding different siteIds values for subdomains. I might be wrong here but if there are two cookies with the same name if the subdomain is set for one of them and you visited the subdomain, wouldn't that return the one that has the subdomain set?
the hash is only on cookie domain and path
in any case, I think you're focussing too much on the hash
the bigger picture is that your visitors are amassing many, large cookies. What you expect/want is one client side cookie with server-side storage for the bulk of the cookie contents, that can somehow be mapped to one or more tracking site IDs. This wasn't part of the scope of this ticket, so it isn't something Piwik does right now. I'll create a new ticket for this feature request and we'll figure it out from there.
See the new ticket at #2680
See also #2211 piwik.js: Cross domain tracking