Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write tracking requests into a queue and don't process them immediately #6075

Closed
tsteur opened this issue Aug 27, 2014 · 16 comments
Closed

Write tracking requests into a queue and don't process them immediately #6075

tsteur opened this issue Aug 27, 2014 · 16 comments
Assignees
Labels
c: Performance For when we could improve the performance / speed of Matomo. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Milestone

Comments

@tsteur
Copy link
Member

tsteur commented Aug 27, 2014

Kinda similar to replaying the piwik.php logs

It would make tracking really fast when we only store the piwik.php requests in a queue (like Redis or actual queues such as RabbitMQ) and a tracker node would then always pick like 100 out of the queue when there are resources and process them via bulk tracking.

So redirects via piwik.php would be very fast and there should be no problem when there is a peak.

@mattab mattab modified the milestones: Piwik 2.7.0, Mid term Aug 27, 2014
@mattab mattab added Major Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. labels Sep 3, 2014
@tsteur tsteur self-assigned this Nov 14, 2014
@tsteur
Copy link
Member Author

tsteur commented Nov 14, 2014

Started working on this but is not that easy to make sure it works under all circumstances.

I guess we can have one queue per idsite? and process them in parallel? Meaning the same user will have a different visitorid etc in different sites right?

@mattab
Copy link
Member

mattab commented Nov 14, 2014

user will have a different visitorid etc in different sites right?

that's already the case that Piwik does not do any cross website reporting so +1

(FYI The only cross wesite reporting is done by our plugin InterSites which uses the config_id to figure out visitors across websites).

@tsteur
Copy link
Member Author

tsteur commented Nov 16, 2014

Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly? Note: Makes it more complicated but still asking ^^ Maybe supported in V2 otherwise. Maybe we will have to develop it like this anyway... will see

@mattab
Copy link
Member

mattab commented Nov 19, 2014

Is there a use case that someone might want to enable the queue only for some websites but not for all? Eg for sites having only a view visits still track it directly?

I don't think we need this for now, easier to consider that whole of Piwik will use only one queue for all websites

Note: in this issue we will work on Redis support for the queue

@truclk
Copy link

truclk commented Dec 1, 2014

I got this functionality implemented 2 years ago for my company. It can handle up to 30 millions pageviews a day with queue.
Matt asked me to publish the script but it's forbidden by company rule. http://forum.piwik.org/read.php?6,73486,page=1#msg-89108

But currently the policy is expired after leaving so that I can help you to make this functionality public :).
I've just init the git repo yesterday but just have found the bulk upload in the track.php then I found this.
https://github.com/afterlastangel/piwik-tracking-queue/tree/master

Do I need to join on this thread or I can just implement it on my own repo then make a Pull request?

  • There are 2 problem that I solved:

Cookies for user ID: I write an API for cookie sync then each time Piwik.js call it will sync the Cookies with other sites.

  • Additional piwik Redis queue for session storage, so that we can have session data realtime. Then bulk process it later with piwik.php.

My last version that I worked on was 1.8 so that I need to read Piwik 2.0 code before contribute :P

@mattab
Copy link
Member

mattab commented Dec 2, 2014

Hi @afterlastangel thanks for the note. @tsteur is really actively working on it for days now, and we could definitely do with some testing especially performance testing. You can see his work in this branch: https://github.com/piwik/piwik/tree/6075_tracker_queue_ondemand

@trucleavinetworks
Copy link

Some ideas for https://github.com/piwik/plugin-QueuedTracking
"There should be only one Redis server to make sure the data will be replayed in the same order as they were recorded"
Solution: use Visitor ID as hash for redis key, so that we can have multiple shards can be shared between servers.

@tsteur
Copy link
Member Author

tsteur commented Dec 8, 2014

We do not have the visitorId at this stage. Another possibility would be to shard by siteId later but also this doesn't always help if there is one big site and many small ones. We will have a look at this kinda stuff later. Optimizing upfront is no good ;)

@truclk
Copy link

truclk commented Dec 9, 2014

@tsteur : I'm very interesting in piwik development, I can contribute up to 10 hrs a week. How can we discuss about what I can help?

As I remembered the Piwik javascript will send the _id for visitorID, we can even use the IP address for hashing.

@tsteur
Copy link
Member Author

tsteur commented Dec 11, 2014

ping @mattab

Re visitorId: we could maybe use the IP address although it would probably have to use the anonymized. It is planned to anonymize/randomize the IP even further more so it would not really work I guess see #5907 (comment). There are surely some possibilities here but would prefer to think about it once we have a concrete case for that. Also on mobile devices the IP can change very often I think (which generates a new visit currently I think).

@mattab
Copy link
Member

mattab commented Dec 15, 2014

Hi @tsteur here is the exact algorithm we should use, which we use currently in import_logs.py in #6664 :

            for param_name_to_use in ['uid', 'cid', '_id', 'cip']:
                if param_name_to_use in self.args:
                    visitor_id = self.args[param_name_to_use]
                    break

so the parameters that should be used to assign visitors to a given queue are in this order of importance (first is most important)

  • uid
  • cid
  • _id
  • cip
  • actual IP address in headers

@tsteur
Copy link
Member Author

tsteur commented Dec 15, 2014

Please note that we won't implement this in V1 in case anyone is wondering. For the first we need to make sure the current solution works and is stable. When needed we can add further complexity later.

@mattab
Copy link
Member

mattab commented Dec 15, 2014

Sure makes sense, I wrote it down because you said we could maybe use the IP address although it would probably have to use the anonymized. which I dont think is good idea ;-)

@tsteur
Copy link
Member Author

tsteur commented Dec 15, 2014

Didn't think it is a good idea that's why there is the "although" ;) No it is already I asked for it... ;) I just wanted to make sure it is clear that we won't implement it right now as a few users asked for it.

@mattab
Copy link
Member

mattab commented Dec 16, 2014

Hi @tsteur is there any work left for this issue?

@tsteur
Copy link
Member Author

tsteur commented Dec 16, 2014

Not really for now. Only wanted to await the test which seems to work now. guess will see how it goes after a try on a bigger environment on a cloud test instance

@tsteur tsteur closed this as completed Dec 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: Performance For when we could improve the performance / speed of Matomo. c: Platform For Matomo platform changes that aren't impacting any of our APIs but improve the core itself. Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc. Major Indicates the severity or impact or benefit of an issue is much higher than normal but not critical.
Projects
None yet
Development

No branches or pull requests

4 participants