Throttling semaphore is not released when job state changes to failed for an internal hangfire issue

Andrew_Borland · July 27, 2022, 4:01pm

If you have a job where the state changes to failed from an internal hangfire error, for example “Failed to change state after X attempts”, that job is moved to the failed state and it continues to hold any semaphores associated to it.

This causes issues when the semaphore reaches it’s limit, jobs will keep attempting to acquire it but it will never be successful, they begin to queue up, all attempting to acquire the semaphore, causing excessive CPU usage on the storage backend.

Andrew_Borland · July 27, 2022, 4:01pm

It also looks like there is no workaround for this atm, if a job has an internal hangfire error, it is moved to the failed state, even if you have it configured to move to a deleted state on errors.

CodeLover254 · July 28, 2022, 7:54am

I also have something related to Semaphore. Am curious to know if each hangfire workers create their own application instance. I have a method where I used SemaphoreSlim and for some reason, it looks like more than 1 worker accessed that method concurrently. Am wondering if that is possible. The SemaphoreSlim instance is a property of the class as shown below.

class MyService{

    private static SemaphoreSlim Semaphore = new SemaphoreSlim(1,1);

    public async Task Retry(){

        try{

            await Semaphore.WaitAsync();

            //do my stuff here

        }finally{

            Semaphore.Release();

        }

    }

}

odinserj · July 29, 2022, 3:35am

By default throttlers are released after moving out of the Processing state, but in case of internal errors it is likely that any further manipulations with custom filters will lead to another error. Internal errors in Hangfire should be treated as exceptional, and there are built-in retries to recover from transient errors. I believe in this case error itself should be investigated and fixed. Could you tell me what exception is mentioned in that failed job?

Andrew_Borland · July 29, 2022, 6:27am

So the exceptions we observed were worker limits in our test clusters, so “The request limit for the database is x and has been reached”. I guess my main concern is other transient sql exceptions from the azure sql gateways could trigger it too, for example “The database … is currently unavailable” that can happen from maintenance / health events inside azure.

odinserj · July 29, 2022, 9:44am

Hm, with both request limit and not available database there’s a high chance that Hangfire will not be able even to move a background job to the failed state, even without calling the filters. Actually that lightweight failed state transition was designed to be a last chance to resolve from a poisoned message to avoid processing job with failing filter again and again.

CodeLover254 · August 2, 2022, 9:18am

Hello Odinserj. I would really appreciate if my question would be answered. Is it possible for multiple workers to access a single recurring job concurrently?

odinserj · August 3, 2022, 7:39am

SemaphoreSlim instance can provide synchronisation only if all of affected background jobs are processed in the same process, and you’ll need to use the static field in this case, because by default class that contains background job method will be instantiated for each background job.

Topic		Replies	Views
Hangfire concurrency and throttling question	0	453	January 22, 2023
Redis timeouts break Hangfire bug? redis	2	1890	February 2, 2018
Jobs Hang - 10 state change attempt(s) - New transaction is not allowed bug? hangfire-pro , sql-server , aspnetcore	6	2294	September 28, 2022
Sequence of jobs seems to stop after an error question	1	1608	August 17, 2016
Using Throttling question	2	1268	June 16, 2022

Throttling semaphore is not released when job state changes to failed for an internal hangfire issue

Related topics