@tsteur opened this Issue on August 27th 2014 Member

Kinda similar to replaying the piwik.php logs

It would make tracking really fast when we only store the piwik.php requests in a queue (like Redis or actual queues such as RabbitMQ) and a tracker node would then always pick like 100 out of the queue when there are resources and process them via bulk tracking.

So redirects via piwik.php would be very fast and there should be no problem when there is a peak.

@tsteur commented on November 14th 2014 Member

Started working on this but is not that easy to make sure it works under all circumstances.

I guess we can have one queue per idsite? and process them in parallel? Meaning the same user will have a different visitorid etc in different sites right?

@mattab commented on November 14th 2014 Member

user will have a different visitorid etc in different sites right?

that's already the case that Piwik does not do any cross website reporting so +1

(FYI The only cross wesite reporting is done by our plugin InterSites which uses the config_id to figure out visitors across websites).

@tsteur commented on November 16th 2014 Member

Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly? Note: Makes it more complicated but still asking ^^ Maybe supported in V2 otherwise. Maybe we will have to develop it like this anyway... will see

@mattab commented on November 19th 2014 Member

Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly?

I don't think we need this for now, easier to consider that whole of Piwik will use only one queue for all websites

Note: in this issue we will work on Redis support for the queue

@afterlastangel commented on December 1st 2014

I got this functionality implemented 2 years ago for my company. It can handle up to 30 millions pageviews a day with queue.
Matt asked me to publish the script but it's forbidden by company rule. http://forum.piwik.org/read.php?6,73486,page=1#msg-89108

But currently the policy is expired after leaving so that I can help you to make this functionality public :).
I've just init the git repo yesterday but just have found the bulk upload in the track.php then I found this.
https://github.com/afterlastangel/piwik-tracking-queue/tree/master

Do I need to join on this thread or I can just implement it on my own repo then make a Pull request?

  • There are 2 problem that I solved:

Cookies for user ID: I write an API for cookie sync then each time Piwik.js call it will sync the Cookies with other sites.

  • Additional piwik Redis queue for session storage, so that we can have session data realtime. Then bulk process it later with piwik.php.

My last version that I worked on was 1.8 so that I need to read Piwik 2.0 code before contribute :P

@mattab commented on December 2nd 2014 Member

Hi @afterlastangel thanks for the note. @tsteur is really actively working on it for days now, and we could definitely do with some testing especially performance testing. You can see his work in this branch: https://github.com/piwik/piwik/tree/6075_tracker_queue_ondemand

@trucleavinetworks commented on December 8th 2014

Some ideas for https://github.com/piwik/plugin-QueuedTracking
"There should be only one Redis server to make sure the data will be replayed in the same order as they were recorded"
Solution: use Visitor ID as hash for redis key, so that we can have multiple shards can be shared between servers.

@tsteur commented on December 8th 2014 Member

We do not have the visitorId at this stage. Another possibility would be to shard by siteId later but also this doesn't always help if there is one big site and many small ones. We will have a look at this kinda stuff later. Optimizing upfront is no good ;)

@afterlastangel commented on December 9th 2014

@tsteur : I'm very interesting in piwik development, I can contribute up to 10 hrs a week. How can we discuss about what I can help?

As I remembered the Piwik javascript will send the _id for visitorID, we can even use the IP address for hashing.

@tsteur commented on December 11th 2014 Member

ping @mattab

Re visitorId: we could maybe use the IP address although it would probably have to use the anonymized. It is planned to anonymize/randomize the IP even further more so it would not really work I guess see https://github.com/piwik/piwik/issues/5907#issuecomment-66404178. There are surely some possibilities here but would prefer to think about it once we have a concrete case for that. Also on mobile devices the IP can change very often I think (which generates a new visit currently I think).

@mattab commented on December 15th 2014 Member

Hi @tsteur here is the exact algorithm we should use, which we use currently in import_logs.py in #6664 :

            for param_name_to_use in ['uid', 'cid', '_id', 'cip']:
                if param_name_to_use in self.args:
                    visitor_id = self.args[param_name_to_use]
                    break

so the parameters that should be used to assign visitors to a given queue are in this order of importance (first is most important)

  • uid
  • cid
  • _id
  • cip
  • actual IP address in headers
@tsteur commented on December 15th 2014 Member

Please note that we won't implement this in V1 in case anyone is wondering. For the first we need to make sure the current solution works and is stable. When needed we can add further complexity later.

@mattab commented on December 15th 2014 Member

Sure makes sense, I wrote it down because you said we could maybe use the IP address although it would probably have to use the anonymized. which I dont think is good idea ;-)

@tsteur commented on December 15th 2014 Member

Didn't think it is a good idea that's why there is the "although" ;) No it is already I asked for it... ;) I just wanted to make sure it is clear that we won't implement it right now as a few users asked for it.

@mattab commented on December 16th 2014 Member

Hi @tsteur is there any work left for this issue?

@tsteur commented on December 16th 2014 Member

Not really for now. Only wanted to await the test which seems to work now. guess will see how it goes after a try on a bigger environment on a cloud test instance

This Issue was closed on December 16th 2014
Powered by GitHub Issue Mirror