"Expired" enqueued jobs, are blocking new jobs


#1

Setup: Hangfire 1.6.17
We are using SQL Server and MSMQ.

Our Hangfire environment has gotten into a bad state. on the dashboard, we see this:

Note the “15/6” next to the “Enqueued” label, but then there are 659 items in the MEDIUMPRIORITY queue. When you drill into this queue, you see that all of them (apart from 15), are “Job expired”:

If you look in the DB, there is no record in the Job table, for all of the expired ones.

When we enqueue new items on the MEDIUMPRIORITY queue, they go to the back and are being blocked. The “expired” (non-existent) job items are very slowly being cleared out of the queue, but essentially are holding up the new job items for hours, until the expired ones are fully cleared out.

So my questions are:
-How could this have happened?
-Where is Hangfire sourcing these 659 job items from, if they are not in the DB?
-How can we clear these “expired” items out of whatever storage they are in, so that newly enqueued jobs aren’t blocked and can be processed immediately?

Thanks!
Simon


#2

OK, I have been able to resolve this issue, although I still can’t explain how it happened.

I had to delete all of the MSMQ items in the queue, that were referencing a job that did not exist in ght hangfire.job table.


#3

That’s because each worker waits for some time before removing background job identifier from a queue. This feature was added, because in some cases (especially when using MSMQ + SQL Azure) it’s possible that enqueue operation is performed before the transaction is fully committed.

If job storage doesn’t support linearizable reads (i.e. don’t block on pending transaction), then null value is returned when trying to fetch a background job. The problem is we can’t distinguish two cases, where job was already expired for some reason or the corresponding transaction will be committed after a few moments.

Starting from 1.7.0 it’s possible to specify that the storage supports linearizable reads (READCOMMITTEDLOCK is used for SQL Server), and in this case workers will not wait on non-existing jobs.