Workers seems to hang and number of active workers slowly descrease to 0

We have an issue with hangfire 1.4.3; We are processing jobs in a background windows process, the jobs get created in web applications. Backend storage is Sql server. We process around 5.000.0000 jobs per 24 hour.

We currently see in the log file that sometimes a worker stops processing new jobs. It stops halfway our own code (no crash or whatever in the log). After a period of 1 all workers end up in this state and the services stops processing any jobs.

Is there any way that we can set a timeout on the maximum duration of a single job within a worker can run?
I do see in code that its hard to reinit a new worker, you basically have to dispose and reinitialize the BackgroundJobServer.

Part of the log where you see worker 4 dissapears… Last log statement is in our own code. But that worker never recovers anymore. The PM log lines are generate by log class I added to Hangfire.Server.Worker.Execute method.

2015-08-06 07:31:50,692 [Worker #4] DEBUG Hangfire.Server.Worker - PM: starting new job 6467407
2015-08-06 07:31:50,712 [Worker #4] INFO BackgroundJobService - AddToAccountLog -> Saving AuditDate=06/08/2015 05:31:46, AuditLogTypeId=Disable, AuditObjectId=Product, UserId=1908545, ClientId=2425734, ValueBefore=, ValueAfter={“clientproductStatus”:false,“disableReason”:“NON_PAYMENT”}
2015-08-06 07:31:50,738 [Worker #4] DEBUG Hangfire.Server.Worker - PM: finished job 6467407
2015-08-06 07:31:50,741 [Worker #4] DEBUG Hangfire.Server.Worker - PM: starting new job 6467408
2015-08-06 07:31:50,756 [Worker #4] INFO BackgroundJobService - Start SendPushTokens userid: 2550448
2015-08-06 07:31:50,758 [Worker #4] INFO BackgroundJobService - PushTokens for userid: 2550448, count: 3
2015-08-06 07:31:50,761 [Worker #4] INFO BackgroundJobService - User id found in Recipients: 2550448
2015-08-06 07:31:59,264 [Worker #5] DEBUG Hangfire.Server.Worker - PM: starting new job 6467409
2015-08-06 07:31:59,285 [Worker #5] INFO BackgroundJobService - Start SendPushTokens userid: 2474850
2015-08-06 07:31:59,289 [Worker #5] INFO BackgroundJobService - PushTokens for userid: 2474850, count: 1
2015-08-06 07:31:59,292 [Worker #5] INFO BackgroundJobService - User id found in Recipients: 2474850
2015-08-06 07:31:59,297 [Worker #1] DEBUG Hangfire.Server.Worker - PM: starting new job 6467410
2015-08-06 07:31:59,315 [Worker #1] INFO BackgroundJobService - Start SendPushTokens userid: 2447702
2015-08-06 07:31:59,317 [Worker #1] INFO BackgroundJobService - PushTokens for userid: 2447702, count: 1
2015-08-06 07:31:59,320 [Worker #1] INFO BackgroundJobService - User id found in Recipients: 2447702
2015-08-06 07:31:59,374 [Worker #2] DEBUG Hangfire.Server.Worker - PM: starting new job 6467411
2015-08-06 07:31:59,387 [Worker #2] INFO BackgroundJobService - Start SendPushTokens userid: 2464063
2015-08-06 07:31:59,389 [Worker #2] INFO BackgroundJobService - PushTokens for userid: 2464063, count: 1
2015-08-06 07:31:59,392 [Worker #2] INFO BackgroundJobService - User id found in Recipients: 2464063
2015-08-06 07:31:59,398 [Worker #3] DEBUG Hangfire.Server.Worker - PM: starting new job 6467412
2015-08-06 07:31:59,413 [Worker #3] INFO BackgroundJobService - Start SendPushTokens userid: 1863531
2015-08-06 07:31:59,415 [Worker #3] INFO BackgroundJobService - PushTokens for userid: 1863531, count: 1
2015-08-06 07:31:59,418 [Worker #3] INFO BackgroundJobService - User id found in Recipients: 1863531
2015-08-06 07:31:59,549 [Worker #5] INFO BackgroundJobService - End SendPushTokens userid: 2474850
2015-08-06 07:31:59,574 [Worker #5] DEBUG Hangfire.Server.Worker - PM: finished job 6467409

              // Checkpoint #4. The job was performed, and it is in the one
                // of the explicit states (Succeeded, Scheduled and so on).
                // It should not be re-queued, but we still need to remove its
                // processing information.

                fetchedJob.RemoveFromQueue();

                // Success point. No things must be done after previous command
                // was succeeded.
                Logger.Log(LogLevel.Debug, () => "PM: finished job " + jobid);
            }
            catch (JobAbortedException a)
            {
                Logger.ErrorException("PM: JobAbortedException exception", a);
                fetchedJob.RemoveFromQueue();                   
            }
            catch (Exception ex)
            {
                Logger.ErrorException("PM: Unhandled exception", ex);
                Logger.DebugException("An exception occurred while processing a job. It will be re-queued.", ex);

                fetchedJob.Requeue();
                throw;
            }

Looks like your code hangs without any timeout. Please consider to investigate what goes wrong with your code and apply a timeout setting for it.

Nope, this is impossible as for now.

Thanks @odinserj

I’m convinced our own code is the problem and causing the worker to lockup. I was hoping that might be some worker monitoring options to recover stuck workers.

This does make my other question more relevant, do you have recommendations on code that runs when en-queued? If you can make some suggestions?

Thanks in advance!
hjm