Timeout/Deadlock occurring in infrastructure

Hi Folks!

We’re seeing some weird timeouts in one of our Production environments that seem to indicate their might be a deadlock going on (although I’m not 100% certain it’s a deadlock).

Our environment:

  • Four IIS Webservers (Windows Server 2008 R2)
  • On each of the IIS Webservers we have our Hangfire-enabled Website (as well as several other Websites)
  • 3 BackgroundJobServers per Hangfire Website, 2 for running Jobs off different queues and one that runs 3 instances of the same IBackgroundProcess with different parameters
  • SQL + MSMQ (DTC) for Storage
  • Hangfire.Core, Hangfire.SqlServer, Hangfire.SqlServer.MSMQ are all on version 1.6.6

Somewhere between 0 and 5 times per day, we get a sequence of error messages from one of our Webservers (not always the same server) that looks likes like the following (I’ll put the actual stacks in at the bottom of this post):

  1. System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached - The bottom of the stacktrace on this exception is RecurringJob.AddOrUpdate

  2. System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) —> System.ComponentModel.Win32Exception (0x80004005): The network path was not found - This seems to come from IBackgroundProcess.Execute

  3. Same as 2.

  4. Same as 2.

We’ve done some log correlation and it looks like on the server we get the error from, the following sequence is happening:

  1. IIS App Pool Recycle

  2. Background Job Server configuration/initialization

  3. RecurringJob.AddOrUpdate - 15 seconds later the timeout on this call

  4. Three errors in IBackgroundProcesses a couple seconds later

In one case today we saw a BackgroundJob.Enqueue occur at the same time on a different server, but not for every case.

The server corrects itself and continues processing messages fine so far, and we don’t seem to be impacted at the application level, but we’ve also only been seeing these errors for a couple weeks now, so not sure if there’s an opportunity for them to impact beyond what they already have.

We have confirmed that there is no outage/infrastructure issue between that server and the SQL/MSMQ Server. We also get no other errors from any of our other websites that are using the same SQL Server and MSMQ instances at the same time (and we’re busy enough that we would)…so we don’t think it’s an actual max pool size problem.

We hypothesized that maybe there’s a deadlock occurring on the RecurringJob.AddOrUpdate or that somehow the MSDTC isn’t getting cleaned up properly after the IIS App Pool Recycle and it isn’t accepting new connections?

Here are the two stack traces I mentioned from above.

The first:

2017-02-09 17:16:04.3789 | ERROR Error while starting HangFire Background Job Servers.

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
   at Hangfire.RecurringJobManager.AddOrUpdate(String recurringJobId, Job job, String cronExpression, RecurringJobOptions options)
   at Ipsos.Assr.HangfireBootstrapper.ScheduleJobs(IAssrWebConfiguration config) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 141
   at Ipsos.Assr.HangfireBootstrapper.Start(IAssrWebConfiguration config, IContainer container, Boolean isManuallyStarted) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 109

The second, third and fourth (pretty much all the same):

2017-02-09 17:16:12.8477 | WARN Error occurred during execution of 'BackgroundProcessingServer' process. Execution will be retried (attempt 3 of 2147483647) in 00:00:05 seconds.

    System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The network path was not found
       at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
       at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
       at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
       at System.Data.SqlClient.SqlConnection.Open()
       at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
       at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
       at Hangfire.Server.BackgroundProcessingServer.Hangfire.Server.IBackgroundProcess.Execute(BackgroundProcessContext context)
       at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
    ClientConnectionId:00000000-0000-0000-0000-000000000000
    Error Number:53,State:0,Class:20

Thanks in advance for any help!

1 Like

We are experiencing the very same issue in our system.
This only happened since the last update to
“Hangfire.Core”: “1.6.8”

Can you please look into this?

This is not a deadlock, and looks like this is the SqlConnection pool contention problem. By default, connection pool size has 100 connections max, and if you are using the same connection string for both your application and Hangfire, they may compete for connections from the same pool. Depending on your application, depending on number of creations of background jobs, and depending on worker count, it may lead to a case, when all the connections are in use. Try to get another connection in this case, and you’ll get an exception above.

You may decrease the number of competing consumers, or increase the max number of connections in the pool.

I guess I have a hard time understanding how this would be a connection pool problem related to my application code. I know I’m not leaking connections and this exception always happens seconds after an App Pool recycle.

The Webservice that’s running only exists to host Hangfire. So we do an App Pool recycle, start up some BackgroundJobServers and call RecurringJob.AddOrUpdate and we’ve used up all our connections to the connection pool? And this only happens less than once a day, despite recycling every 15ish minutes?

There is no other code in this Webservice that would be running at the same time.

I am also getting this as well.

But all of mine are occurring GetJobData

In some cases, I see that I have 200 agents running. Should I scale it back to 100

System.InvalidOperationException

Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
   at Hangfire.SqlServer.SqlServerConnection.GetJobData(String id)
   at Hangfire.Server.Worker.PerformJob(BackgroundProcessContext context, IStorageConnection connection, String jobId)

Please include some code on how to do this, as we are having the same problem. It only occurs when the application pool recycles for some reason. In .NET Framework, we were using the bootstrapper class + the IIS ApplicationPreload settings to handle app restart, which worked great. Switching to .NET Core has been huge trouble. I’ve tried setting the application pool to “AlwaysRunning”, setting the application pool timeout to 0, keeping the integration setting instead of “No Managed Code”, and nothing has helped so far. If I set

services.AddHangfireServer(options =>
{
    options.WorkerCount = 1;
}

will that help?

Quite some time passed since this topic was created, but I still get errors like this in our System.

We have a Project, where we are using Hangfire v1.7.31 in a linux Docker container, using an MSSQL Database to store jobs. The setup is fairly minimal, there runs one (1) daily recurring job, sending one (1) summary email. Within the job, entity framework core is used to query some data. The application and hangfire share the same Database, but there should not be 100+ connections (especially in the DEV-System, where the error also appears). Maybe it is a bit over the top to use Hangfire for this, but there might be more jobs in future :).

During the weekly maintenance it happens sometimes that the used SQL server is restarted / not reachable. I recognized the error above happens always after the SQL-Server maintenance window.
Is there maybe some edge case where Hangfire keeps stale connections and does not try to reconnect correctly after the connection to the MSSQL Server was lost?

For PROD I scheduled a weekly restart of my service just after the maintenance window to get rid of this issue, but I don’t think this is the right way to fix this.
Is there any solution to this issue, besides restarting the service and increasing the number of connections (which might just delay the issue)?

Here is my Hangfire-Configuration, do you see anything which could cause the issue?

// add auth policy for hangfire
services.AddAuthorization(o =>
{
    o.AddPolicy(HangfireAuthorizationFilter.AuthorizationPolicyName, b =>
    {
        b.AddAuthenticationSchemes(OpenIdConnectDefaults.AuthenticationScheme);
        b.AddRequirements().RequireAuthenticatedUser();
    });
});
// Add Hangfire services.
services.AddHangfire(hangfireConfig => hangfireConfig
    .UseSerilogLogProvider()
    .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
    .UseSimpleAssemblyNameTypeSerializer()
    .UseRecommendedSerializerSettings()
    .UseSqlServerStorage(configuration.GetConnectionString("Db"), new SqlServerStorageOptions
    {
        CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
        SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
        QueuePollInterval = TimeSpan.Zero,
        UseRecommendedIsolationLevel = true,
        DisableGlobalLocks = true
    }));

// add Hangfire jobs to DI
var hangfireJobTypes = System.Reflection.Assembly.GetExecutingAssembly()
                .GetTypes().Where(t => t.IsAssignableTo(typeof(IHangfireJob)));
var classes = hangfireJobTypes.Where(t => t.IsClass);
foreach (var hangfireJobClass in classes) {
    services.AddTransient(hangfireJobClass);
}

// restrict worker count => does not help to fix the issue if e.g. 5 are used
var maxWorkerCount = configuration.GetValue<int?>("Hangfire:MaxWorkerCount");
services.AddHangfireServer(o => {
    if(maxWorkerCount.HasValue)
    {
        o.WorkerCount = maxWorkerCount.Value;
    }
});