DistributedLock TimeoutException with batches, but batch operations get executed

esn303 · September 22, 2015, 11:55am

When using batches with Hangfire 1.5.0-beta3 and Pro 1.4-beta1 (with SQL Express 2014), queued jobs will get executed, but with an error logged after a couple of seconds:

Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:batch:12c77919-af37-4779-8729-b0ddb65dd4c9:lock' resource.

They are thus repeated over and over again, so the job runs forever and the batch job never finishes. The method called for the batch items only contains a simple line log.Debug (log4net) to see if the atomic operations in the batch are executed. They are, but there seems to be a problem with the distributed lock that I don’t understand well enough, it seems.

I couldn’t find anything helpful (at least not for me) or related in the source or on the forum.

Maybe it’s of relevance that the BackgroundJob that is executed to start the BatchJob is executed from within another BackgroundJob? The “outer most” BackgroundJob has the DisableConcurrentExecution attribute set; the others not. Also they are in different queues (via attributes).

Any ideas? Thanks!

odinserj · September 22, 2015, 12:23pm

@esn303, thank you for this report! There was a too small timeout for batch state change operations that may exceed when there are a lot of jobs in a batch, or slow queries to a storage for other reasons. I’ll fix this and release both Pro 1.3.2 and 1.4.0-beta2.

esn303 · September 22, 2015, 12:28pm

Alright, cool. Great we found that. Could you merge these fixes into Pro 1.4-beta as well? Otherwise I would be stuck … Happily using 1.5 for development already. Thanks!

odinserj · September 22, 2015, 1:13pm

Hangfire Pro 1.3.2 and Hangfire Pro 1.4.0-beta2 just released.

esn303 · September 22, 2015, 8:44pm

Wow. Thanks. I upgraded via proget and double checked everything. Cleaned DB tables (removed all of them), et al. Still I get

2015-09-22 22:40:45,164 [Worker #10d07956] DEBUG Hangfire.Server.Worker - Error occurred during execution of 'Worker #1
0d07956' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:01 seconds.
Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed
 lock on the 'HangFire:batch:a689e1f0-2da9-4262-b4a2-64bb76eb8499:lock' resource.
   bei Hangfire.SqlServer.SqlServerDistributedLock.Acquire(IDbConnection connection, String resource, TimeSpan timeout)
   bei Hangfire.SqlServer.SqlServerDistributedLock..ctor(SqlServerStorage storage, String resource, TimeSpan timeout)
   bei Hangfire.SqlServer.SqlServerConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   bei Hangfire.Batches.BatchJobStorageExtensions.AcquireDistributedBatchLock(IStorageConnection connection, String bat

Do you have any recommendation on how to debug? What information would help you isolate the issue? Le me know if/how I can be of assistance?

odinserj · September 26, 2015, 1:37pm

Are you using continuations? If yes, how much background jobs do you have in a batch and in each continuation (approximately)?

Please ensure also that your Hangfire.Pro.dll file under the bin folder is the same as the corresponding file in the packages/Hangfire.Pro.1.4.0-beta2\net45 folder. You can also clean up the solution (Build -> Clean menu item) or delete the entire bin folder manually.

emmekappa · March 14, 2017, 6:06pm

I have the same issue with Hangfire.Pro 1.4.8.0

odinserj · March 15, 2017, 4:32pm

@emmekappa, how many jobs do you have in batches, and what storage are you using, SQL Azure or on-premise SQL Server? Also, could you post here the full stack trace of your exception? If you are using SQL Azure, and have a lot of batched jobs completed almost at the same time, it is possible to get such an exception under normal circumstances. In this case I’ll consider setting a higher value for timeout, or catch it to not to confuse anyone.

emmekappa · March 16, 2017, 8:53am

@odinserj we’re using Redis Cache on Azure.
Our batches are up to 6k jobs.

here’s the stack trace:

    Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'recurring-jobs:lock' resource.
   at Hangfire.Pro.Redis.RedisDistributedLock.RetryUntilTrue(Func`1 action, Nullable`1 timeOut)
   at Hangfire.Pro.Redis.RedisDistributedLock..ctor(IDatabase database, String resource, TimeSpan timeout)
   at Hangfire.Pro.Redis.RedisConnection.DistributedLockWrapper..ctor(RedisConnection connection, String resource, TimeSpan timeout)
   at Hangfire.Pro.Redis.RedisConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

emmekappa · March 20, 2017, 1:41pm

@odinserj news about this?

Topic		Replies	Views
Issue with Hangfire lock timeout bug? sql-server , distributed-locks , aspnetcore	0	197	December 6, 2024
Distributed Lock Timeout Exception - Timeout expired question recurring , aspnetcore	2	16361	February 10, 2021
Long running jobs cause "Distributed Lock Timeout Exception", and multiple instance of the same job bug? sql-server , aspnetcore	0	3676	August 6, 2021
Timeout for DisableConcurrentExecution question sql-server	0	3704	February 14, 2016
Why do I often get a TimeoutException? question	0	1213	February 7, 2017

DistributedLock TimeoutException with batches, but batch operations get executed

Related topics