Kinda similar to replaying the piwik.php logs
It would make tracking really fast when we only store the piwik.php requests in a queue (like Redis or actual queues such as RabbitMQ) and a tracker node would then always pick like 100 out of the queue when there are resources and process them via bulk tracking.
So redirects via piwik.php would be very fast and there should be no problem when there is a peak.
Started working on this but is not that easy to make sure it works under all circumstances.
I guess we can have one queue per idsite? and process them in parallel? Meaning the same user will have a different visitorid etc in different sites right?
user will have a different visitorid etc in different sites right?
that's already the case that Piwik does not do any cross website reporting so +1
(FYI The only cross wesite reporting is done by our plugin InterSites which uses the config_id to figure out visitors across websites).
Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly? Note: Makes it more complicated but still asking ^^ Maybe supported in V2 otherwise. Maybe we will have to develop it like this anyway... will see
Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly?
I don't think we need this for now, easier to consider that whole of Piwik will use only one queue for all websites
Note: in this issue we will work on Redis support for the queue
I got this functionality implemented 2 years ago for my company. It can handle up to 30 millions pageviews a day with queue.
Matt asked me to publish the script but it's forbidden by company rule. http://forum.piwik.org/read.php?6,73486,page=1#msg-89108
But currently the policy is expired after leaving so that I can help you to make this functionality public :).
I've just init the git repo yesterday but just have found the bulk upload in the track.php then I found this.
Do I need to join on this thread or I can just implement it on my own repo then make a Pull request?
Cookies for user ID: I write an API for cookie sync then each time Piwik.js call it will sync the Cookies with other sites.
My last version that I worked on was 1.8 so that I need to read Piwik 2.0 code before contribute :P
Hi @afterlastangel thanks for the note. @tsteur is really actively working on it for days now, and we could definitely do with some testing especially performance testing. You can see his work in this branch: https://github.com/piwik/piwik/tree/6075_tracker_queue_ondemand
Some ideas for https://github.com/piwik/plugin-QueuedTracking
"There should be only one Redis server to make sure the data will be replayed in the same order as they were recorded"
Solution: use Visitor ID as hash for redis key, so that we can have multiple shards can be shared between servers.
We do not have the visitorId at this stage. Another possibility would be to shard by siteId later but also this doesn't always help if there is one big site and many small ones. We will have a look at this kinda stuff later. Optimizing upfront is no good ;)
@tsteur : I'm very interesting in piwik development, I can contribute up to 10 hrs a week. How can we discuss about what I can help?
Re visitorId: we could maybe use the IP address although it would probably have to use the anonymized. It is planned to anonymize/randomize the IP even further more so it would not really work I guess see https://github.com/piwik/piwik/issues/5907#issuecomment-66404178. There are surely some possibilities here but would prefer to think about it once we have a concrete case for that. Also on mobile devices the IP can change very often I think (which generates a new visit currently I think).
for param_name_to_use in ['uid', 'cid', '_id', 'cip']: if param_name_to_use in self.args: visitor_id = self.args[param_name_to_use] break
so the parameters that should be used to assign visitors to a given queue are in this order of importance (first is most important)
Please note that we won't implement this in V1 in case anyone is wondering. For the first we need to make sure the current solution works and is stable. When needed we can add further complexity later.
Sure makes sense, I wrote it down because you said
we could maybe use the IP address although it would probably have to use the anonymized. which I dont think is good idea ;-)
Didn't think it is a good idea that's why there is the "although" ;) No it is already I asked for it... ;) I just wanted to make sure it is clear that we won't implement it right now as a few users asked for it.
Not really for now. Only wanted to await the test which seems to work now. guess will see how it goes after a try on a bigger environment on a cloud test instance