HangFire will only attempt to process the job around 60% of the time. Further research shows that the job is being successfully inserted into the HangFire.JobQueue table, but it is never processed.
Looking at the HangFire.JobQueue table, I noticed a pattern: All instances of jobs stuck with the StateName of Enqueued also have an ExpireAt value of NULL.
Any ideas on what’s going on here, or how to troubleshoot this further? It seems like basic test case that’s hitting a high failure rate. I’m running the most recent version of HangFire (v1.3.1) with MSMQ.
Sorry. That was a typo on my part. I meant to state that the table Hangfire.Job (not Hangfire.JobQueue) shows the job that was created, but its stuck with the Enqueued status.
Here’s my Hangfire dashboard after calling BackgroundJob.Enqueue(() => Console.WriteLine(“Test”)) numerous times:
You can see that 45 calls succeeded, but 27 are stuck in the queue that never clear out.
Using the ColouredConsoleLogProvider, here’s the only Hangfire output I see. Interestingly enough, I’m not seeing any Hangfire output when I’m enqueuing a task (regardless if it successfully completes or not):
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker Manager'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Schedule Poller'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Schedule Poller'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Watchdog'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Watchdog'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Recurring Job Scheduler'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'SQL Records Expiration Manager'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.ServerBootstrapper) Starting server components...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker Manager'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Schedule Poller'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Heartbeat' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Watchdog'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Schedule Poller' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Recurring Job Scheduler' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'SQL Records Expiration Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #1'...
2015-01-28 12:21:26 [INFO] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Watchdog' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #2'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #4'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #1' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #3'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #2' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #3' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #4' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Counter'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Job'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'List'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Set'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Hash'...
I just configured Serilog to see if I get any different output in the logs. Same result. The windows event log isn’t showing any interesting information as well.
Is there something I have to do to enable Hangfire to log additional details?
Different logging libraries will show the same result. Some magic happen there, lets demystify it. Workers take job identifiers from the MSMQ queue. Can you temporary disable Hangfire Server and look to the queue itself to ensure all the job identifiers are written here (you can see them at Computer Management/Services/MSMQ)?
Yeah. I expected the same results from the logging mechanism. Just wanted to make sure it wasn’t the logging mechanism itself that was neglecting to capture output.
Let’s see… I’ve disabled the Hangfire server and enqueued 10 tasks. They all appear in the MSMQ “Queue Messages” section:
Haha. Interestingly enough. When I start the Hangfire server, it successfully processed all of them.
I then shut the Hangfire server down, enqueued 20 tasks, verified they were in MSMQ, then started the Hangfire server and all 20 processed successfully as well.
So it seems to be the case that tasks only get “stuck” in the queue if they are enqueued while the Hangfire server is running.
Wow! I think I figured it out! Believe it or not, it appears to be the logging mechanism!
I can replicate it. If I do NOT provide a logging mechanism. Some of the jobs will get stuck with the “Enqueued”! However, if I provide a logging mechanism every job seems to process successfully!
I’ll continue investigating to see if this accurate.
“Always use logging” So, seems like there is a race condition. Looking for a worker possibility to take a job before it switched to the Enqueued state.
The problem was related to a very quick worker that picked up a background job that was not fully initialized for the enqueued state. I’ve fixed it by waiting for a full initialization (with timeout) before trying to change the state to Processing. If timeout expires (or when process is killed), the job simply moves back to the queue.