Jobs being created in database, stuck as Enqueued

While performing a simple Enqueue such as:

BackgroundJob.Enqueue(() => Console.WriteLine("Test!"));

HangFire will only attempt to process the job around 60% of the time. Further research shows that the job is being successfully inserted into the HangFire.JobQueue table, but it is never processed.

Looking at the HangFire.JobQueue table, I noticed a pattern: All instances of jobs stuck with the StateName of Enqueued also have an ExpireAt value of NULL.

This query at the results shows the correlation:

select Id, StateId, StateName, CreatedAt, ExpireAt
from HangFire.Job

Id          StateId     StateName            CreatedAt               ExpireAt
----------- ----------- -------------------- ----------------------- -----------------------
1           3           Succeeded            2015-01-22 19:27:27.047 2015-01-23 19:27:27.230
2           4           Enqueued             2015-01-22 19:27:32.553 NULL
3           7           Succeeded            2015-01-22 19:28:09.447 2015-01-23 19:28:09.503
4           10          Succeeded            2015-01-22 19:28:21.517 2015-01-23 19:28:21.553
5           13          Succeeded            2015-01-22 19:28:24.030 2015-01-23 19:28:24.077
6           16          Succeeded            2015-01-22 19:28:25.873 2015-01-23 19:28:25.947
7           17          Enqueued             2015-01-22 19:28:27.943 NULL
8           20          Succeeded            2015-01-22 19:29:59.967 2015-01-23 19:30:00.023
9           23          Succeeded            2015-01-22 19:30:04.363 2015-01-23 19:30:04.393
10          24          Enqueued             2015-01-22 19:30:05.690 NULL
11          27          Succeeded            2015-01-22 19:39:29.937 2015-01-23 19:39:30.067
12          30          Succeeded            2015-01-22 19:39:31.967 2015-01-23 19:39:32.077
13          31          Enqueued             2015-01-22 19:39:33.313 NULL

Any ideas on what’s going on here, or how to troubleshoot this further? It seems like basic test case that’s hitting a high failure rate. I’m running the most recent version of HangFire (v1.3.1) with MSMQ.

1 Like

The following two quotes contain a contradiction:

If you are using MSMQ and your JobQueue table is not empty, there is a misconfiguration. Can you show me your Hangfire configuration code?

Sorry. That was a typo on my part. I meant to state that the table Hangfire.Job (not Hangfire.JobQueue) shows the job that was created, but its stuck with the Enqueued status.

Expiration time is NULL for enqueued jobs, this is ok. I’ve just created 500000 jobs with MSMQ, and all is fine.

Can you show me the logs produced by Hangfire. Windows Event Log may also contain some interesting records.

Here’s my Hangfire dashboard after calling BackgroundJob.Enqueue(() => Console.WriteLine(“Test”)) numerous times:

You can see that 45 calls succeeded, but 27 are stuck in the queue that never clear out.

Using the ColouredConsoleLogProvider, here’s the only Hangfire output I see. Interestingly enough, I’m not seeing any Hangfire output when I’m enqueuing a task (regardless if it successfully completes or not):

    2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker Manager'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Schedule Poller'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Schedule Poller'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Watchdog'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Watchdog'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Recurring Job Scheduler'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'SQL Records Expiration Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.ServerBootstrapper) Starting server components...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Schedule Poller'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Heartbeat' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Watchdog'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Schedule Poller' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Recurring Job Scheduler' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'SQL Records Expiration Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #1'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Watchdog' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #2'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #4'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #1' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #3'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #2' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #3' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #4' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Counter'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Job'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'List'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Set'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Hash'...

I just configured Serilog to see if I get any different output in the logs. Same result. The windows event log isn’t showing any interesting information as well.

Is there something I have to do to enable Hangfire to log additional details?

Different logging libraries will show the same result. Some magic happen there, lets demystify it. Workers take job identifiers from the MSMQ queue. Can you temporary disable Hangfire Server and look to the queue itself to ensure all the job identifiers are written here (you can see them at Computer Management/Services/MSMQ)?

Yeah. I expected the same results from the logging mechanism. Just wanted to make sure it wasn’t the logging mechanism itself that was neglecting to capture output. :wink:

Let’s see… I’ve disabled the Hangfire server and enqueued 10 tasks. They all appear in the MSMQ “Queue Messages” section:

Ok, and after starting the Server… (feeling like we are pulling a rabbit out of a hat)

Haha. Interestingly enough. When I start the Hangfire server, it successfully processed all of them.

I then shut the Hangfire server down, enqueued 20 tasks, verified they were in MSMQ, then started the Hangfire server and all 20 processed successfully as well.

So it seems to be the case that tasks only get “stuck” in the queue if they are enqueued while the Hangfire server is running.

Wow! I think I figured it out! Believe it or not, it appears to be the logging mechanism!

I can replicate it. If I do NOT provide a logging mechanism. Some of the jobs will get stuck with the “Enqueued”! However, if I provide a logging mechanism every job seems to process successfully!

I’ll continue investigating to see if this accurate.

“Always use logging” :slight_smile: So, seems like there is a race condition. Looking for a worker possibility to take a job before it switched to the Enqueued state.

Hmmm… I think I spoke too soon. I can still get jobs stuck in the queue, even when providing a logging mechanism.

I have a modified Hangfire.Core assembly with some more logging: https://www.dropbox.com/s/ayu3kp3qz2cdeo9/HangfireTracing.zip?dl=0 I’ve added some messages with TRACE level to the suspicious places. Can you send me the resulting logs?

I’ve found the race condition, both MSMQ and RabbitMQ affected. Will fix it soon.

That’s great news! Can’t wait to get my hands on the fix. :smile:

Also, I’m using Hangfire in production as well. So I’ll be sure to mention that on the new pinned thread in the forum.

Shane, can you check the fix: https://www.dropbox.com/s/k91cqf6atmjbukh/HangfireFixedRace.zip?dl=0 ? I’m not able to reproduce the source bug neither in office, nor at home :frowning:

The problem was related to a very quick worker that picked up a background job that was not fully initialized for the enqueued state. I’ve fixed it by waiting for a full initialization (with timeout) before trying to change the state to Processing. If timeout expires (or when process is killed), the job simply moves back to the queue.

Thanks! I’m trying out the fix now.

Fantastic! It works perfectly now.

Okay, preparing version 1.3.3 (finally a serious bug I wasn’t aware of :smile:). Thank you for the debugging session!