Hi Folks!
We’re seeing some weird timeouts in one of our Production environments that seem to indicate their might be a deadlock going on (although I’m not 100% certain it’s a deadlock).
Our environment:
- Four IIS Webservers (Windows Server 2008 R2)
- On each of the IIS Webservers we have our Hangfire-enabled Website (as well as several other Websites)
- 3 BackgroundJobServers per Hangfire Website, 2 for running Jobs off different queues and one that runs 3 instances of the same IBackgroundProcess with different parameters
- SQL + MSMQ (DTC) for Storage
- Hangfire.Core, Hangfire.SqlServer, Hangfire.SqlServer.MSMQ are all on version 1.6.6
Somewhere between 0 and 5 times per day, we get a sequence of error messages from one of our Webservers (not always the same server) that looks likes like the following (I’ll put the actual stacks in at the bottom of this post):
-
System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached - The bottom of the stacktrace on this exception is RecurringJob.AddOrUpdate
-
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) —> System.ComponentModel.Win32Exception (0x80004005): The network path was not found - This seems to come from IBackgroundProcess.Execute
-
Same as 2.
-
Same as 2.
We’ve done some log correlation and it looks like on the server we get the error from, the following sequence is happening:
-
IIS App Pool Recycle
-
Background Job Server configuration/initialization
-
RecurringJob.AddOrUpdate - 15 seconds later the timeout on this call
-
Three errors in IBackgroundProcesses a couple seconds later
In one case today we saw a BackgroundJob.Enqueue occur at the same time on a different server, but not for every case.
The server corrects itself and continues processing messages fine so far, and we don’t seem to be impacted at the application level, but we’ve also only been seeing these errors for a couple weeks now, so not sure if there’s an opportunity for them to impact beyond what they already have.
We have confirmed that there is no outage/infrastructure issue between that server and the SQL/MSMQ Server. We also get no other errors from any of our other websites that are using the same SQL Server and MSMQ instances at the same time (and we’re busy enough that we would)…so we don’t think it’s an actual max pool size problem.
We hypothesized that maybe there’s a deadlock occurring on the RecurringJob.AddOrUpdate or that somehow the MSDTC isn’t getting cleaned up properly after the IIS App Pool Recycle and it isn’t accepting new connections?
Here are the two stack traces I mentioned from above.
The first:
2017-02-09 17:16:04.3789 | ERROR Error while starting HangFire Background Job Servers.
System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.Open()
at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
at Hangfire.RecurringJobManager.AddOrUpdate(String recurringJobId, Job job, String cronExpression, RecurringJobOptions options)
at Ipsos.Assr.HangfireBootstrapper.ScheduleJobs(IAssrWebConfiguration config) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 141
at Ipsos.Assr.HangfireBootstrapper.Start(IAssrWebConfiguration config, IContainer container, Boolean isManuallyStarted) in D:\TeamCity-BuildAgent\work\d8f3e61fd489f501\MIS\Assr\Ipsos.Assr\App_Start\HangfireBootstrapper.cs:line 109
The second, third and fourth (pretty much all the same):
2017-02-09 17:16:12.8477 | WARN Error occurred during execution of 'BackgroundProcessingServer' process. Execution will be retried (attempt 3 of 2147483647) in 00:00:05 seconds.
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The network path was not found
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
at System.Data.SqlClient.SqlConnection.Open()
at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func`2 func)
at Hangfire.Server.BackgroundProcessingServer.Hangfire.Server.IBackgroundProcess.Execute(BackgroundProcessContext context)
at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
ClientConnectionId:00000000-0000-0000-0000-000000000000
Error Number:53,State:0,Class:20
Thanks in advance for any help!