Hangfire stuck after SQL exception

Tags: #<Tag:0x00007f4fa187d4b0> #<Tag:0x00007f4fa187d320>

Hi everyone,
I am new to Hangfire and recently I deployed my first app using Hangfire technology to a production server. Using Hanfire 1.7.18 with just MSSQL Server. The production machine is Ubuntu 16.04.
I am facing a transient issue when mostly after a few hours running (3 to 20) hangfire stucks after some Sql Exception (A transport-level error has occurred when receiving results from the server.) and even after a minute when in log appeared “Execution DelayedJobScheduler recovered from the Failed state after 00:00:59.2297707 and is in the Running state now” my dotnet process uses 100 % CPU and Hangfire stopes to process jobs. Sql server in that state seems to be running normally, there is no big amount of connections, etc.
I reduced workers count to 1, to see whether it is any kind of concurrent problem, but it is not.
I am using Dependency Injection to create my DbContext, so there shouldn’t be an unclosed connection leak there.
I understand that it is basically Sql server bottleneck problem since this exception happens in moments when I try to Enqueue several jobs at the same time. Normally we have approx. 1 job per minute, but it is based on incoming HTTP requests from clients, so it could happen, that 5 requests will come at the same time and for a few seconds overwhelm our server. My problem with this is that Hangfire seems to fail to recover after that state.
Am I missing something?
Please, see the relevant logs attached.
Thank you very much for any input. I am pretty tired to restart a server every few hours.

Log1.txt
Log2.txt
Log3.txt

Could you share your connection string without any compromising values that is.

Also have you tried to use hangfire as a windows service with multiple server instance to share the load?

Due to the transient error I would also recommend changing the connection setting to have transient retry.

Hello, thank you for your response.
During my “investigation” I found out that load is not probably the case. Sometimes the issue I described above occurs even in an idle state when for a minute or so there was no job to proceed and there no jobs to enqueue. I read from the SQL Server log that just before transient errors starting to appear in my app SQL Server restarts itself. Or at least I see startup messages of SQL Server in its log. So this explains exceptions but not the cause of a server restart or why Hangfire loads my processor on 100 % afterward and there is no complete regeneration until I restart my dotnet service.
Here is my connection string to a HangFire database:

Data Source=.;Initial Catalog=HangFireDb;User Id=username;Password=password;TrustServerCertificate=true;Connection Timeout=30;

Are you suggesting to set ConnectRetryCount to some higher value than the default one (1)?

Thank you very much for your help.

My suggestion would be to setup connection resiliently on the process so that even if the SQL server drops off you would not suffer data loss.

But your issue here seems to be more based on a few factors (while hard to see without some code to go with it). I would check the following for my initial troubleshooting.

Based on your comments:

  1. DI - Connection context should be loaded at the service level to prevent reload of connection on every execution
  2. DI - Scope - Lifetime
  3. EF query to prevent connection pooling should use .AsNoTracking() / .ToList() - This actually more than often is the cause of high connection pooling when using EF. as we as developer tend to make joins on IQueryable objects and for each join a connection happens.

This is of course very high level, I’m decent at troubleshooting but can only do so much without seeing what is going on :wink:

For Hangfire I would not run it on the same server as SQL, in my environment IIS / SQL / HangFire Windows Services are all on separate servers. Hangfire itself uses very little CPU, so the question would be more focused on the task that the worker is trying to run.