Throttling semaphore is not released when job state changes to failed for an internal hangfire issue

If you have a job where the state changes to failed from an internal hangfire error, for example “Failed to change state after X attempts”, that job is moved to the failed state and it continues to hold any semaphores associated to it.

This causes issues when the semaphore reaches it’s limit, jobs will keep attempting to acquire it but it will never be successful, they begin to queue up, all attempting to acquire the semaphore, causing excessive CPU usage on the storage backend.

It also looks like there is no workaround for this atm, if a job has an internal hangfire error, it is moved to the failed state, even if you have it configured to move to a deleted state on errors.

I also have something related to Semaphore. Am curious to know if each hangfire workers create their own application instance. I have a method where I used SemaphoreSlim and for some reason, it looks like more than 1 worker accessed that method concurrently. Am wondering if that is possible. The SemaphoreSlim instance is a property of the class as shown below.

class MyService{

    private static SemaphoreSlim Semaphore = new SemaphoreSlim(1,1);

    public async Task Retry(){

        try{

            await Semaphore.WaitAsync();

            //do my stuff here

        }finally{

            Semaphore.Release();

        }

    }

}

By default throttlers are released after moving out of the Processing state, but in case of internal errors it is likely that any further manipulations with custom filters will lead to another error. Internal errors in Hangfire should be treated as exceptional, and there are built-in retries to recover from transient errors. I believe in this case error itself should be investigated and fixed. Could you tell me what exception is mentioned in that failed job?

So the exceptions we observed were worker limits in our test clusters, so “The request limit for the database is x and has been reached”. I guess my main concern is other transient sql exceptions from the azure sql gateways could trigger it too, for example “The database … is currently unavailable” that can happen from maintenance / health events inside azure.

Hm, with both request limit and not available database there’s a high chance that Hangfire will not be able even to move a background job to the failed state, even without calling the filters. Actually that lightweight failed state transition was designed to be a last chance to resolve from a poisoned message to avoid processing job with failing filter again and again.

Hello Odinserj. I would really appreciate if my question would be answered. Is it possible for multiple workers to access a single recurring job concurrently?

SemaphoreSlim instance can provide synchronisation only if all of affected background jobs are processed in the same process, and you’ll need to use the static field in this case, because by default class that contains background job method will be instantiated for each background job.

1 Like