[Challenging (impossible?) scenario] Multi tenant queue architecture for rate limiting

Hello everyone,

We have got a challenging scenario on our hands that we have not been able to find advice for anywhere.

Our system is a multi tenant system and we use hangfire to deal with a large number of background tasks.

Dashboard screenshot:

As you can see we have 5 queues to deal with different job types (so we can have different worker threads per queue) and those queues contain jobs from all customers.

This has worked well so far but now we would like to avoid one customer from starving all others, for example one customer running many layers should not prevent other customers from running their layers.
To achieve this we were looking at the Hangfire.Throttling package but the package only works per job type and the job types are shared across customers.

Every job has the customer code information in it so it knows which customer it belongs to.

Does anybody have any advice on how to limit the number of concurrent execution in this scenario?

Thank you in advance for taking on the challenge.

I don’t think there is an out of the box solution for this.

You could create something like this attribute in your code, which would throttle concurrent execution of a job https://github.com/alastairtree/Hangfire.MaximumConcurrentExecutions/blob/master/Hangfire.MaximumConcurrentExecutions/MaximumConcurrentExecutionsAttribute.cs

But you want to throttle concurrent execution based not just on the job, but on a “Tenant” basis as well - for that you would need to modify the GetResource method call; Perhaps to look at a TenantId that you passed as an argument to your job

Check out the GetFingerprint method in this gist, which uses the arguments on the job to “unique-ify” the locking https://gist.github.com/sbosell/3831f5bb893b20e82c72467baf8aefea

1 Like

Hi Dlongnecker and thank you for the suggestion. I think I understand what you mean and it sounds like it could work. However, I’m slightly terrified at the thought of using distributed locks. Specifically because I am not aware of what happens if a server dies while holding a lock.

If, for example, a job type for a specific tenant can have 2 instances running concurrently and a server dies while holding the lock, wouldn’t that mean that now only one job can run for that job type and that tenant? And if another server dies holding the same type of lock wouldn’t that mean that no jobs can run?

Thus ending up with the Obi Wan problem?

Hey ninoalloy - I’m just a user, not a developer of hangfire, so I can’t speak to all scenarios.

I can’t speak to the exact mechanism - but I have used hangfire in production for quite some time with SQL Server for storage. In a cloudy/ephemeral server environment with frequent server crashes/shutdowns - distributed locks reliably release in the event of a server crash.

I would encourage some testing on your end to build confidence.

Thank you for the explanation and for sharing the experience. It does sound hard to believe that Hangfire did not take server crashes into account when creating distributed locks so we will attempt to implement the proposed solution and do some testing on our side before merging into dev.

Thank you again for the help, hopefully this post can help someone else in the future.


1 Like

I just realized this above may not work as you would hope - it will throttle concurrency on whatever basis you choose - but queued jobs are not guaranteed to execute in the order they are received. May or may not be an issue in your case.

If it is - you could do the following. I’ve done it before with great effect. There are probably “nicer” ways to do it - but this one is simple enough.

What I’ve done in the past - is create a job that simply runs every minute, pickups up all the queues - and fires off one background task for each queue.

Each instance of that background task processes queue items with whatever appropriate parallelism - and exits when the queue runs dry (or some fixed amount of time, or fixed number of items completed)

Here is some rough pseudo code demonstrating what I mean:

		// Schedule this to run every minute - this job runs and exits quickly
    [DisableConcurrentExecution(timeoutInSeconds: 0)]
    public static async Task StartAllQueues(PerformContext context) {
        using (var sql = new SqlConnection(ApplicationSettings.ConnectionString())) {
            IEnumerable<string> queues = new GetAllQueues();
            foreach(var QueueName in queues) {
                // Just fire off the job - DisableConcurrentExecutionWithParameters will ensure only one instance runs at a time.
                BackgroundJob.Enqueue(() => QueueProcessor.Process(QueueName, null));
	// Don't schedule this job
    [AutomaticRetry(Attempts = 0)]
    [System.ComponentModel.DisplayName("QueueProcessor.Process {0}")] 
	// from https://gist.github.com/sbosell/3831f5bb893b20e82c72467baf8aefea
    [DisableConcurrentExecutionWithParameters(timeoutInSeconds: 0)]
    public static async Task ProcessQueue(string QueueName, PerformContext context) {
		var t = () => {
			while(true) {
				var workItem = Dequeue(QueueName)

				if(workItem == null)
		Task.WaitAll(Enumerable,Range(1, QueueConcurrency(QueueName)).Select(i => t()))
1 Like

I have come to believe our application manages to be affected by all edge cases but luckily enough this is not one of them since we are not relying on the order of the tasks. By reading your code I managed to get that it was relying on requeuing and it could thus not guarantee the same order to be kept.
However from what I can gather the unordering will only happen for tasks that need to be “pushed back”.

Thank you a lot for the code you shared anyway!

1 Like

Hi, I’ve been looking at this one on ninos team here, thanks for the suggestion :slight_smile:

Got it working but small liskov problem: I had to use a non-zero timeout when acquiring the lock; we’re using hangfire.pro.redis background storage and it doesn’t ever acquire available locks with timeout of zero.

It does work with TimeSpan.FromTicks(1)…not sure if this will be load related so it’s a bit worrying - anyone know?

Edit: By timeout - I’m referring to the one passed to filterContext.Connection.AcquireDistributedLock