Hangfire stuck after SQL exception

AndyHen · December 8, 2020, 2:36pm

Hi everyone,
I am new to Hangfire and recently I deployed my first app using Hangfire technology to a production server. Using Hanfire 1.7.18 with just MSSQL Server. The production machine is Ubuntu 16.04.
I am facing a transient issue when mostly after a few hours running (3 to 20) hangfire stucks after some Sql Exception (A transport-level error has occurred when receiving results from the server.) and even after a minute when in log appeared “Execution DelayedJobScheduler recovered from the Failed state after 00:00:59.2297707 and is in the Running state now” my dotnet process uses 100 % CPU and Hangfire stopes to process jobs. Sql server in that state seems to be running normally, there is no big amount of connections, etc.
I reduced workers count to 1, to see whether it is any kind of concurrent problem, but it is not.
I am using Dependency Injection to create my DbContext, so there shouldn’t be an unclosed connection leak there.
I understand that it is basically Sql server bottleneck problem since this exception happens in moments when I try to Enqueue several jobs at the same time. Normally we have approx. 1 job per minute, but it is based on incoming HTTP requests from clients, so it could happen, that 5 requests will come at the same time and for a few seconds overwhelm our server. My problem with this is that Hangfire seems to fail to recover after that state.
Am I missing something?
Please, see the relevant logs attached.
Thank you very much for any input. I am pretty tired to restart a server every few hours.

Log1.txt
Log2.txt
Log3.txt

Vincent_Blain · December 15, 2020, 6:23am

Could you share your connection string without any compromising values that is.

Also have you tried to use hangfire as a windows service with multiple server instance to share the load?

Due to the transient error I would also recommend changing the connection setting to have transient retry.

AndyHen · December 15, 2020, 2:10pm

Hello, thank you for your response.
During my “investigation” I found out that load is not probably the case. Sometimes the issue I described above occurs even in an idle state when for a minute or so there was no job to proceed and there no jobs to enqueue. I read from the SQL Server log that just before transient errors starting to appear in my app SQL Server restarts itself. Or at least I see startup messages of SQL Server in its log. So this explains exceptions but not the cause of a server restart or why Hangfire loads my processor on 100 % afterward and there is no complete regeneration until I restart my dotnet service.
Here is my connection string to a HangFire database:

Data Source=.;Initial Catalog=HangFireDb;User Id=username;Password=password;TrustServerCertificate=true;Connection Timeout=30;

Are you suggesting to set ConnectRetryCount to some higher value than the default one (1)?

Thank you very much for your help.

Vincent_Blain · December 15, 2020, 9:51pm

My suggestion would be to setup connection resiliently on the process so that even if the SQL server drops off you would not suffer data loss.

But your issue here seems to be more based on a few factors (while hard to see without some code to go with it). I would check the following for my initial troubleshooting.

Based on your comments:

DI - Connection context should be loaded at the service level to prevent reload of connection on every execution
DI - Scope - Lifetime
EF query to prevent connection pooling should use .AsNoTracking() / .ToList() - This actually more than often is the cause of high connection pooling when using EF. as we as developer tend to make joins on IQueryable objects and for each join a connection happens.

This is of course very high level, I’m decent at troubleshooting but can only do so much without seeing what is going on

For Hangfire I would not run it on the same server as SQL, in my environment IIS / SQL / HangFire Windows Services are all on separate servers. Hangfire itself uses very little CPU, so the question would be more focused on the task that the worker is trying to run.

Topic		Replies	Views
Hangfire stops processing jobs in queue after database errors on Azure SQL bug? sql-server , sql-azure , queues	6	11280	July 24, 2015
Hangfire exausts connection pool resources and hangs bug? recurring , sql-server , aspnetcore	1	4085	May 19, 2021
Failed to change state to a 'Failed' one due to an exception after 10 retry attempts (v. 1.7.16) bug? sql-server	9	2334	October 20, 2020
Jobs Hang - 10 state change attempt(s) - New transaction is not allowed bug? hangfire-pro , sql-server , aspnetcore	6	2300	September 28, 2022
Hangfire Console Server crashes bug? sql-server	2	2034	September 27, 2018

Hangfire stuck after SQL exception

Related topics