@MichaelHeerklotz opened this Issue on May 2nd 2019 Contributor

This is a pretty nasty denial of service bug:

Some actions, for example:

  • adding a new user
  • updating a user or user rights
  • deleting a user
  • changing global site settings
  • and more

will call core\Tracker\Cache->deleteTrackerCache(), which deletes the whole cache.
This causes core\Plugin\Manager->getLicenseValidInfo() to get called for every activated premium plugin on the next tracking request or admin panel request.
getLicenseValidInfo() will send a http request to the matomo marketplace server.

On big setups like mine (>8000 tracking requests per minute) there are 350+ php workers running at the same time, many of them will call getLicenseValidInfo() roughly at the same time after a deleteTrackerCache() cache.

This means: About every 4th time I edit a users permissions or add a new site, the API request doing this admin operation will take over 60 seconds to complete, if it completes at all.
And during and after this, all 350+ php workers will get stuck, all trying to send a http request to the matomo marketplace. Bascially thousands of http request at the same time.

Our load balancer will then report for about 60s with "502 Bad Gateway" because all web servers are stuck on sending thousands of http requests to the matomo marketplace.

We have about 12 premium plugins activated, and about 700 php workers, this means up to 700*12 = 8400 http requests at the same time caused by a single admin action. These will possibly add up if editing multiple users in a short time period.

During this 60+ seconds we lose thousands of tracking requests.

I think an easy fix for now would be to introduce a new "plugin cache", that is independent from the normal "cache" and not affected by deleteTrackerCache(),

Alternative fix: do not call getLicenseValidInfo() during tracking requests.

I can try to create a PR for this, but I would like to know how you want to proceed.

@tsteur commented on May 2nd 2019 Member

Quickly looking at the tracker cache the problem looks indeed that deleteTrackerCache deletes all lazy caches in https://github.com/matomo-org/matomo/blob/3.10.0-b1/core/Tracker/Cache.php#L216 might not be too easy to change though as it could break something maybe.

We could in tracker mode only access a cached result. Eventually an archiver or something would create the cache key again. Or if you currently delete a user, the cache would be quickly created again through the UI while you are logged in. Of course if someone uses eg the API to delete a user, and then doesn't issue another API request, and doesn't have any archiving active then it may take a while for the cache to be written again but it should all be fine. In the worst case for a little while a bit more information is tracked.

What do you think about https://github.com/matomo-org/matomo/pull/14405 @MichaelHeerklotz ?

@MichaelHeerklotz commented on May 3rd 2019 Contributor

Thank you for your patch @tsteur.
https://github.com/matomo-org/matomo/pull/14405 looks good, I have tested it on our production setup.

Changing user permissions takes now

  • about 100ms if no license checks are done
  • about 900ms if license checks are done (after a previous user permission change)
  • one time it spiked to about 5.5s but only once in about 20 tries, so this is acceptable I would say

I was not able to make it overload again, so the denial of service / overload problem seems to be gone.

While the problem is gone, maybe it would still be a good idea to separate the plugin license cache in the future, maybe with Matomo 4? Most of the time when I edit users or sites, I edit a bunch of them, and it seems a bit wasteful to check all plugin licenses again for every single action.

But this issue is resolved with your patch, thanks a lot :)

@tsteur commented on May 5th 2019 Member

image

The cache is kind of separate. The problem is really that Tracker\Cache should here instead only deleteCacheGeneral() and have some feature like self::getCache()->flushLike('cacheWebsiteAttributePrefix_').

This Issue was closed on May 5th 2019
Powered by GitHub Issue Mirror