Timeout/Deadlock occurring in infrastructure

queues
sql-server
msdtc
recurring
Tags: #<Tag:0x00007f69fc527b60> #<Tag:0x00007f69fc5279a8> #<Tag:0x00007f69fc5277f0> #<Tag:0x00007f69fc5276b0>

#1

Hi Folks!

We’re seeing some weird timeouts in one of our Production environments that seem to indicate their might be a deadlock going on (although I’m not 100% certain it’s a deadlock).

Our environment:

  • Four IIS Webservers (Windows Server 2008 R2)
  • On each of the IIS Webservers we have our Hangfire-enabled Website (as well as several other Websites)
  • 3 BackgroundJobServers per Hangfire Website, 2 for running Jobs off different queues and one that runs 3 instances of the same IBackgroundProcess with different parameters
  • SQL + MSMQ (DTC) for Storage
  • Hangfire.Core, Hangfire.SqlServer, Hangfire.SqlServer.MSMQ are all on version 1.6.6

Somewhere between 0 and 5 times per day, we get a sequence of error messages from one of our Webservers (not always the same server) that looks likes like the following (I’ll put the actual stacks in at the bottom of this post):

  1. System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached - The bottom of the stacktrace on this exception is RecurringJob.AddOrUpdate

  2. System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) —> System.ComponentModel.Win32Exception (0x80004005): The network path was not found - This seems to come from IBackgroundProcess.Execute

  3. Same as 2.

  4. Same as 2.

We’ve done some log correlation and it looks like on the server we get the error from, the following sequence is happening:

  1. IIS App Pool Recycle

  2. Background Job Server configuration/initialization

  3. RecurringJob.AddOrUpdate - 15 seconds later the timeout on this call

  4. Three errors in IBackgroundProcesses a couple seconds later

In one case today we saw a BackgroundJob.Enqueue occur at the same time on a different server, but not for every case.

The server corrects itself and continues processing messages fine so far, and we don’t seem to be impacted at the application level, but we’ve also only been seeing these errors for a couple weeks now, so not sure if there’s an opportunity for them to impact beyond what they already have.

We have confirmed that there is no outage/infrastructure issue between that server and the SQL/MSMQ Server. We also get no other errors from any of our other websites that are using the same SQL Server and MSMQ instances at the same time (and we’re busy enough that we would)…so we don’t think it’s an actual max pool size problem.

We hypothesized that maybe there’s a deadlock occurring on the RecurringJob.AddOrUpdate or that somehow the MSDTC isn’t getting cleaned up properly after the IIS App Pool Recycle and it isn’t accepting new connections?

Here are the two stack traces I mentioned from above.

The first:

2017-02-09 17:16:04.3789 | ERROR Error while starting HangFire Background Job Servers.

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
   at Hangfire.RecurringJobManager.AddOrUpdate(String recurringJobId, Job job, String cronExpression, RecurringJobOptions options)
   at Ipsos.Assr.HangfireBootstrapper.ScheduleJobs(IAssrWebConfiguration config) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 141
   at Ipsos.Assr.HangfireBootstrapper.Start(IAssrWebConfiguration config, IContainer container, Boolean isManuallyStarted) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 109

The second, third and fourth (pretty much all the same):

2017-02-09 17:16:12.8477 | WARN Error occurred during execution of 'BackgroundProcessingServer' process. Execution will be retried (attempt 3 of 2147483647) in 00:00:05 seconds.

    System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The network path was not found
       at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
       at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
       at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
       at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
       at System.Data.SqlClient.SqlConnection.Open()
       at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
       at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
       at Hangfire.Server.BackgroundProcessingServer.Hangfire.Server.IBackgroundProcess.Execute(BackgroundProcessContext context)
       at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
    ClientConnectionId:00000000-0000-0000-0000-000000000000
    Error Number:53,State:0,Class:20

Thanks in advance for any help!


#2

We are experiencing the very same issue in our system.
This only happened since the last update to
“Hangfire.Core”: “1.6.8”

Can you please look into this?


#3

This is not a deadlock, and looks like this is the SqlConnection pool contention problem. By default, connection pool size has 100 connections max, and if you are using the same connection string for both your application and Hangfire, they may compete for connections from the same pool. Depending on your application, depending on number of creations of background jobs, and depending on worker count, it may lead to a case, when all the connections are in use. Try to get another connection in this case, and you’ll get an exception above.

You may decrease the number of competing consumers, or increase the max number of connections in the pool.


#4

I guess I have a hard time understanding how this would be a connection pool problem related to my application code. I know I’m not leaking connections and this exception always happens seconds after an App Pool recycle.

The Webservice that’s running only exists to host Hangfire. So we do an App Pool recycle, start up some BackgroundJobServers and call RecurringJob.AddOrUpdate and we’ve used up all our connections to the connection pool? And this only happens less than once a day, despite recycling every 15ish minutes?

There is no other code in this Webservice that would be running at the same time.


#5

I am also getting this as well.

But all of mine are occurring GetJobData

In some cases, I see that I have 200 agents running. Should I scale it back to 100

System.InvalidOperationException

Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.

System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
   at Hangfire.SqlServer.SqlServerConnection.GetJobData(String id)
   at Hangfire.Server.Worker.PerformJob(BackgroundProcessContext context, IStorageConnection connection, String jobId)