DistributedLock TimeoutException with batches, but batch operations get executed

sql-server
hangfire-pro
Tags: #<Tag:0x00007fa53bf03588> #<Tag:0x00007fa53bf03448>

#1

When using batches with Hangfire 1.5.0-beta3 and Pro 1.4-beta1 (with SQL Express 2014), queued jobs will get executed, but with an error logged after a couple of seconds:

Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'HangFire:batch:12c77919-af37-4779-8729-b0ddb65dd4c9:lock' resource.

They are thus repeated over and over again, so the job runs forever and the batch job never finishes. The method called for the batch items only contains a simple line log.Debug (log4net) to see if the atomic operations in the batch are executed. They are, but there seems to be a problem with the distributed lock that I don’t understand well enough, it seems.

I couldn’t find anything helpful (at least not for me) or related in the source or on the forum.

Maybe it’s of relevance that the BackgroundJob that is executed to start the BatchJob is executed from within another BackgroundJob? The “outer most” BackgroundJob has the DisableConcurrentExecution attribute set; the others not. Also they are in different queues (via attributes).

Any ideas? Thanks!


#2

@esn303, thank you for this report! There was a too small timeout for batch state change operations that may exceed when there are a lot of jobs in a batch, or slow queries to a storage for other reasons. I’ll fix this and release both Pro 1.3.2 and 1.4.0-beta2.


#3

Alright, cool. Great we found that. Could you merge these fixes into Pro 1.4-beta as well? Otherwise I would be stuck … :wink: Happily using 1.5 for development already. Thanks!


#4

Hangfire Pro 1.3.2 and Hangfire Pro 1.4.0-beta2 just released.


#5

Wow. Thanks. I upgraded via proget and double checked everything. Cleaned DB tables (removed all of them), et al. Still I get

2015-09-22 22:40:45,164 [Worker #10d07956] DEBUG Hangfire.Server.Worker - Error occurred during execution of 'Worker #1
0d07956' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:01 seconds.
Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed
 lock on the 'HangFire:batch:a689e1f0-2da9-4262-b4a2-64bb76eb8499:lock' resource.
   bei Hangfire.SqlServer.SqlServerDistributedLock.Acquire(IDbConnection connection, String resource, TimeSpan timeout)
   bei Hangfire.SqlServer.SqlServerDistributedLock..ctor(SqlServerStorage storage, String resource, TimeSpan timeout)
   bei Hangfire.SqlServer.SqlServerConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   bei Hangfire.Batches.BatchJobStorageExtensions.AcquireDistributedBatchLock(IStorageConnection connection, String bat

Do you have any recommendation on how to debug? What information would help you isolate the issue? Le me know if/how I can be of assistance? :slight_smile:


#6

Are you using continuations? If yes, how much background jobs do you have in a batch and in each continuation (approximately)?

Please ensure also that your Hangfire.Pro.dll file under the bin folder is the same as the corresponding file in the packages/Hangfire.Pro.1.4.0-beta2\net45 folder. You can also clean up the solution (Build -> Clean menu item) or delete the entire bin folder manually.


#7

I have the same issue with Hangfire.Pro 1.4.8.0


#8

@emmekappa, how many jobs do you have in batches, and what storage are you using, SQL Azure or on-premise SQL Server? Also, could you post here the full stack trace of your exception? If you are using SQL Azure, and have a lot of batched jobs completed almost at the same time, it is possible to get such an exception under normal circumstances. In this case I’ll consider setting a higher value for timeout, or catch it to not to confuse anyone.


#9

@odinserj we’re using Redis Cache on Azure.
Our batches are up to 6k jobs.

here’s the stack trace:

    Hangfire.Storage.DistributedLockTimeoutException: Timeout expired. The timeout elapsed prior to obtaining a distributed lock on the 'recurring-jobs:lock' resource.
   at Hangfire.Pro.Redis.RedisDistributedLock.RetryUntilTrue(Func`1 action, Nullable`1 timeOut)
   at Hangfire.Pro.Redis.RedisDistributedLock..ctor(IDatabase database, String resource, TimeSpan timeout)
   at Hangfire.Pro.Redis.RedisConnection.DistributedLockWrapper..ctor(RedisConnection connection, String resource, TimeSpan timeout)
   at Hangfire.Pro.Redis.RedisConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context)
   at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

#10

@odinserj news about this?