Jobs being created in database, stuck as Enqueued

ssewell · January 22, 2015, 7:52pm

While performing a simple Enqueue such as:

BackgroundJob.Enqueue(() => Console.WriteLine("Test!"));

HangFire will only attempt to process the job around 60% of the time. Further research shows that the job is being successfully inserted into the HangFire.JobQueue table, but it is never processed.

Looking at the HangFire.JobQueue table, I noticed a pattern: All instances of jobs stuck with the StateName of Enqueued also have an ExpireAt value of NULL.

This query at the results shows the correlation:

select Id, StateId, StateName, CreatedAt, ExpireAt
from HangFire.Job

Id          StateId     StateName            CreatedAt               ExpireAt
----------- ----------- -------------------- ----------------------- -----------------------
1           3           Succeeded            2015-01-22 19:27:27.047 2015-01-23 19:27:27.230
2           4           Enqueued             2015-01-22 19:27:32.553 NULL
3           7           Succeeded            2015-01-22 19:28:09.447 2015-01-23 19:28:09.503
4           10          Succeeded            2015-01-22 19:28:21.517 2015-01-23 19:28:21.553
5           13          Succeeded            2015-01-22 19:28:24.030 2015-01-23 19:28:24.077
6           16          Succeeded            2015-01-22 19:28:25.873 2015-01-23 19:28:25.947
7           17          Enqueued             2015-01-22 19:28:27.943 NULL
8           20          Succeeded            2015-01-22 19:29:59.967 2015-01-23 19:30:00.023
9           23          Succeeded            2015-01-22 19:30:04.363 2015-01-23 19:30:04.393
10          24          Enqueued             2015-01-22 19:30:05.690 NULL
11          27          Succeeded            2015-01-22 19:39:29.937 2015-01-23 19:39:30.067
12          30          Succeeded            2015-01-22 19:39:31.967 2015-01-23 19:39:32.077
13          31          Enqueued             2015-01-22 19:39:33.313 NULL

Any ideas on what’s going on here, or how to troubleshoot this further? It seems like basic test case that’s hitting a high failure rate. I’m running the most recent version of HangFire (v1.3.1) with MSMQ.

odinserj · January 28, 2015, 2:48pm

The following two quotes contain a contradiction:

If you are using MSMQ and your JobQueue table is not empty, there is a misconfiguration. Can you show me your Hangfire configuration code?

ssewell · January 28, 2015, 3:33pm

Sorry. That was a typo on my part. I meant to state that the table Hangfire.Job (not Hangfire.JobQueue) shows the job that was created, but its stuck with the Enqueued status.

odinserj · January 28, 2015, 4:34pm

Expiration time is NULL for enqueued jobs, this is ok. I’ve just created 500000 jobs with MSMQ, and all is fine.

Can you show me the logs produced by Hangfire. Windows Event Log may also contain some interesting records.

ssewell · January 28, 2015, 6:25pm

Here’s my Hangfire dashboard after calling BackgroundJob.Enqueue(() => Console.WriteLine(“Test”)) numerous times:

You can see that 45 calls succeeded, but 27 are stuck in the queue that never clear out.

Using the ColouredConsoleLogProvider, here’s the only Hangfire output I see. Interestingly enough, I’m not seeing any Hangfire output when I’m enqueuing a task (regardless if it successfully completes or not):

    2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker Manager'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Schedule Poller'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Schedule Poller'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Server Watchdog'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Server Watchdog'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Recurring Job Scheduler'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'SQL Records Expiration Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.ServerBootstrapper) Starting server components...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Heartbeat'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Schedule Poller'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Heartbeat' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Server Watchdog'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Schedule Poller' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Recurring Job Scheduler
'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Recurring Job Scheduler' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'SQL Records Expiration
Manager'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'SQL Records Expiration Manager' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #1'...
2015-01-28 12:21:26 [INFO]  (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Server Watchdog' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #2'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting a new thread for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #3'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #1'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Starting server component 'Worker #4'...
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #2'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #1' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #3'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #2' started.
2015-01-28 12:21:26 [TRACE] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Sending start request for server component 'Worker #4'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #3' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.Server.AutomaticRetryServerComponentWrapper) Server component 'Worker #4' started.
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Counter'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Job'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'List'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Set'...
2015-01-28 12:21:26 [DEBUG] (Hangfire.SqlServer.ExpirationManager) Removing outdated records from table 'Hash'...

ssewell · January 28, 2015, 6:36pm

I just configured Serilog to see if I get any different output in the logs. Same result. The windows event log isn’t showing any interesting information as well.

Is there something I have to do to enable Hangfire to log additional details?

odinserj · January 28, 2015, 7:20pm

Different logging libraries will show the same result. Some magic happen there, lets demystify it. Workers take job identifiers from the MSMQ queue. Can you temporary disable Hangfire Server and look to the queue itself to ensure all the job identifiers are written here (you can see them at Computer Management/Services/MSMQ)?

ssewell · January 28, 2015, 7:33pm

Yeah. I expected the same results from the logging mechanism. Just wanted to make sure it wasn’t the logging mechanism itself that was neglecting to capture output.

Let’s see… I’ve disabled the Hangfire server and enqueued 10 tasks. They all appear in the MSMQ “Queue Messages” section:

odinserj · January 28, 2015, 7:35pm

Ok, and after starting the Server… (feeling like we are pulling a rabbit out of a hat)

ssewell · January 28, 2015, 7:38pm

Haha. Interestingly enough. When I start the Hangfire server, it successfully processed all of them.

I then shut the Hangfire server down, enqueued 20 tasks, verified they were in MSMQ, then started the Hangfire server and all 20 processed successfully as well.

So it seems to be the case that tasks only get “stuck” in the queue if they are enqueued while the Hangfire server is running.

ssewell · January 28, 2015, 7:43pm

Wow! I think I figured it out! Believe it or not, it appears to be the logging mechanism!

I can replicate it. If I do NOT provide a logging mechanism. Some of the jobs will get stuck with the “Enqueued”! However, if I provide a logging mechanism every job seems to process successfully!

I’ll continue investigating to see if this accurate.

odinserj · January 28, 2015, 7:46pm

“Always use logging” So, seems like there is a race condition. Looking for a worker possibility to take a job before it switched to the Enqueued state.

ssewell · January 28, 2015, 8:03pm

Hmmm… I think I spoke too soon. I can still get jobs stuck in the queue, even when providing a logging mechanism.

odinserj · January 28, 2015, 8:22pm

I have a modified Hangfire.Core assembly with some more logging: https://www.dropbox.com/s/ayu3kp3qz2cdeo9/HangfireTracing.zip?dl=0 I’ve added some messages with TRACE level to the suspicious places. Can you send me the resulting logs?

odinserj · January 28, 2015, 8:33pm

I’ve found the race condition, both MSMQ and RabbitMQ affected. Will fix it soon.

ssewell · January 28, 2015, 8:36pm

That’s great news! Can’t wait to get my hands on the fix.

Also, I’m using Hangfire in production as well. So I’ll be sure to mention that on the new pinned thread in the forum.

odinserj · January 28, 2015, 9:11pm

Shane, can you check the fix: https://www.dropbox.com/s/k91cqf6atmjbukh/HangfireFixedRace.zip?dl=0 ? I’m not able to reproduce the source bug neither in office, nor at home

The problem was related to a very quick worker that picked up a background job that was not fully initialized for the enqueued state. I’ve fixed it by waiting for a full initialization (with timeout) before trying to change the state to Processing. If timeout expires (or when process is killed), the job simply moves back to the queue.

ssewell · January 28, 2015, 9:12pm

Thanks! I’m trying out the fix now.

ssewell · January 28, 2015, 9:18pm

Fantastic! It works perfectly now.

odinserj · January 28, 2015, 9:21pm

Okay, preparing version 1.3.3 (finally a serious bug I wasn’t aware of ). Thank you for the debugging session!

Topic		Replies	Views
RecurringJob Stuck Enqueued using MSMQ but not in SQL Server bug? msmq , queues	14	5133	April 27, 2015
Jobs stuck in enqueued state bug? msmq , sql-server	9	29222	April 21, 2015
Hangfire job remain in enqueued bug? sql-server , queues	0	1328	January 22, 2017
Jobs stuck in enqued state bug? aspnetcore	0	971	September 6, 2021
Enqueued jobs won't start processing bug? msmq , queues	11	17060	February 7, 2020

Jobs being created in database, stuck as Enqueued

Related topics