Batching hundreds of jobs causes connection pool timeout

Tags: #<Tag:0x00007fb0053c09f8> #<Tag:0x00007fb0053c0930>

I have a case where I want to enqueue hundreds of jobs in a batch, and the enqueued jobs contain database operations on the same database server as the one Hangfire runs on.

The problem is that after starting the batch, the Hangfire dashboard becomes unresponsive, and I start getting lots of errors like this:

Error occurred during execution of 'Worker #d726f5b8' process. Execution will be retried (attempt 1 of 2147483647) in 00:00:00 seconds.

Stack trace:

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
   at Hangfire.SqlServer.SqlServerStorage.UseTransaction[T](Func`3 func, Nullable`1 isolationLevel)
   at Hangfire.SqlServer.SqlServerStorage.UseTransaction(Action`2 action)
   at Hangfire.SqlServer.SqlServerWriteOnlyTransaction.Commit()
   at Hangfire.Batches.BatchJobSupportAttribute.OnStateApplied(ApplyStateContext context, IWriteOnlyTransaction outerTransaction)
   at Hangfire.States.StateMachine.ApplyState(ApplyStateContext initialContext)
   at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context, BackgroundJob backgroundJob, IState toState, String oldStateName)
   at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

It would seem the batching of the jobs causes some kind of connection pool shortage.

These errors do not appear if I queue the jobs up normally.

Using Hangfire 1.6.3 with Hangfire Pro 1.4.7.

Any idea how to fix this?

There was a problem related to connection leaks, when nested distributed locks acquired for the same resource. The problem is related to Hangfire.SqlServer, and introduced in version 1.5.0, the latest Pro version helped to reveal it. Both batches and continuations are affected.

I’ve just released Hangfire 1.6.4 with the fix, thanks for reporting this!

1 Like

Hi,

It looks the problem is exists in version 1.6.4…

              foreach ( i=0-316)
                {                   
                    foreach (x=0-2) 
                    {
                        parentId = BackgroundJob.Enqueue<Task1Wapper>(task => task.Execute(i,x));
                    }                 
                }

and the code task wrapper is:

public class Task1Wrapper
{

    public void Execute(int i, int x)
    {
        using (MyDbContext context = new MyDbContext())
        {
              // do sth     
        }
    }
} 

The following photo shows my Performance counter, the blue line refers to Number of Connection Pooled. At the end of diagram, the system throws a timeout exception.

The NumberOfPooledConnections performance counter does not explain the timeout exception. This is how the connection pool works: it creates a lot of connections in advance, to decrease the delays caused by opening a connection. The correct performance counter that may show connections aren’t closed is NumberOfActiveConnections, but it requires the following code in web.config or app.config:

<system.diagnostics>
  <switches>
    <add name="ConnectionPoolPerformanceCounterDetail"
         value="4"/>
  </switches>
</system.diagnostics>

Can you post here the result of NumberOfActiveConnections counter, as well as full information about exception, including message and stack trace?