Hangfire stops processing jobs in queue after database errors on Azure SQL

Tags: #<Tag:0x00007f1861447188> #<Tag:0x00007f1861446fd0> #<Tag:0x00007f1861446d78>

Env
Hangfire, Hangfire.Core, Hangfire.SqlServer : alll version 1.3
Hangfire.Dashboard.Authorization 2.1.0
MVC 5.2.2
Hosted on Microsoft Azure as a Website
Azure SQL Database

Issue
Hangfire works just fine but after several days (let’s say 4-8 days), the jobs no longer get processed.
In the dashboard, the Queues shows 1/1 instead on 0/0 and the DEFAULT queue contains several jobs but they never get processed.
To get the queue to process I need to restart the website from the Azure portal.
When I do so hangfire processes the queue and get’s back to normal but after a few days it happens again.
I also had the same issue with the previous version of hangfire (1.2…) but the version before that worked flawlessly for months. (not sure I think it was 1.1…)

Config
My config and code are pretty basic and hasn’t changed since a while. So I’m pretty sure it’s not a change in config/usage.

Startup config:

        app.UseHangfire(config =>
        {
            config.UseSqlServerStorage("DefaultConnection");
            config.UseAuthorizationFilters(new AuthorizationFilter
            {
                Roles = "Admin"
            });
            config.UseServer();
        });

The “DefaultConnection” is the same as used by EntityFramework (CodeFirst)

This is how I enqueue a job:

BackgroundJob.Enqueue(() => DispatchSavedEvent(id, name));

My method’s signature:

public static void DispatchSavedEvent(int eventId, string eventName)

I checked the application logs and could not find any errors.
It was working at midnight, then hangfire did some " Removing outdated records from table" at a given interval but no error was logged.
Then starting 6am events were enqueued but were never processed, still no errors.

Not sure where to look to resolve this issue… help would be appreciated.

I set my logging level to TRACE and I got the following when the issue happend (since it happens every few days):

 09:18:24.382 Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #2' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): The service has encountered an error processing your request.

09:18:28.102 Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #1' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. 

09:18:30.174 Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #4' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): The service has encountered an error processing your request. Please try again. Error code 40143.

09:18:30.368 Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #5' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) 

09:18:30.651 Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #3' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) 

Looks like Hangfire is stopping the worker when a database error occurs.
Unfortunately on Azure SQL errors can occurs and it is considered normal.
The application should implementing some sort of retry logic, Transient Fault Handling (Building Real-World Cloud Apps with Azure)

Is hangfire using such a logic ?
Also would I be better of using the Azure service bus to schedule the jobs ?
or if hangfire hasn’t some retry logic for the database I will have the same issue anyway ?!

Is there some way to restart the workers when they are stopped because of a database error ?

One full error:

09:18:28.102  Hangfire.Server.Worker |Fatal| Fatal error occurred during execution of 'Worker #1' component. It will be stopped. See the exception for details. System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) ---> System.ComponentModel.Win32Exception (0x80004005): An existing connection was forcibly closed by the remote host
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior)
   at Dapper.SqlMapper.<QueryImpl>d__11`1.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Dapper.SqlMapper.Query[T](IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Boolean buffered, Nullable`1 commandTimeout, Nullable`1 commandType)
   at Hangfire.SqlServer.SqlServerJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.SqlServer.SqlServerConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(CancellationToken cancellationToken)
   at Hangfire.Server.ServerSupervisor.ExecuteComponent()
   at Hangfire.Server.ServerSupervisor.RunComponent()
ClientConnectionId:b9fe70b0-43b8-4e0e-9cd3-b6a1f7b7609d
Error Number:10054,State:0,Class:20, System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) ---> System.ComponentModel.Win32Exception (0x80004005): An existing connection was forcibly closed by the remote host
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior)
   at Dapper.SqlMapper.<QueryImpl>d__11`1.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Dapper.SqlMapper.Query[T](IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Boolean buffered, Nullable`1 commandTimeout, Nullable`1 commandType)
   at Hangfire.SqlServer.SqlServerJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.SqlServer.SqlServerConnection.FetchNextJob(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(CancellationToken cancellationToken)
   at Hangfire.Server.ServerSupervisor.ExecuteComponent()
   at Hangfire.Server.ServerSupervisor.RunComponent()
ClientConnectionId:b9fe70b0-43b8-4e0e-9cd3-b6a1f7b7609d
Error Number:10054,State:0,Class:20
1 Like

@chris, thank you for the detailed problem description. This is a regression, I’ve created a bug for it:

Planned to Hangfire 1.3.1

@chris, Hangfire 1.3.1 just released with your bug fix – https://github.com/HangfireIO/Hangfire/releases/tag/v1.3.1

Fantastic ! Thanks a lot !

I’m afraid we’ve just experienced this bug again. And i’m using Hangfire 1.3.4.

“Could not allocate space for object ‘HangFire.Counter’.‘PK_HangFire_Counter’ in database ‘’ because the ‘PRIMARY’ filegroup is full.”

As you can see, the database got full. After increasing the database size, Hang fire did not continue processing the jobs.
I had to manually recycle the app pool to get it running again.

Could it be that this bug was re-introduced after 1.3.1?