Seeing a very serious issue with Hangfire the last couple days.
The only difference is I’ve started making extensive using of BatchJob.Attach
.
Basically jobs that are supposed to have run 5 hours back aren’t getting enqueued. I have ~800+ retries that haven’t gotten enqueued after the first try, even though they’re supposed to.
Basically looks like then entire queuing system has gone dead.
Trying to find a repro case …
@tejasxp, what storage are you using and what Hangfire and Hangfire.Pro versions do you have? Could you also send some screenshots from your dashboard that may describe your problem?
Ok -
There is no bug, but I think a performance bottleneck. I followed the code, and it simply looks like DelayedJobScheduler isn’t able to keep up with the rate of job creations, so the jobs aren’t transitioning from Scheduled => Enqueued state very fast.
Using Hangfire 1.6.12 + SqlServer Storage (on Azure P11 SQL Server), with Hangfire Pro 2.0.1.
select count(*) from hangfire.[set] where [key] = 'schedule' and score < 1501247561
Shows me around ~100K however after I killed some of my produces, its enqueuing these scheduled jobs really fast.
I’ve added ~200K jobs in the last 12 hours.
The strange thing here is DelayedJobScheduler looks very simple – just an sp_applock + fetch the item with the lowest score from the scheduled set, so my guess is with continuous production of jobs, there’s is some other hotspot that is slowing this down (potentially due to transactions?)
I did notice your comment in the other thread about TransactionScope using Serializable transactions by default. Maybe this is the reason.
A bit more investigation reveals that there is actually a bottleneck in the DelayedJobScheduler. This is the use-case:
- I schedule 100,000 jobs to be run off a LOW-PRIORITY queue at 9am
- I schedule 5 jobs that need to be run at most within 15 mins off a HIGH-PRIORITY queue at 9.15 AM, AFTER I have scheduled the jobs above
Jobs in #2 is now only going to processed after #1, since the DelayedJobScheduler dequeues the head of the scheduled jobs queue one at a time.
Any ideas on how I can achieve the intended effect?
So to summarize:
Basic problem is, if I have a million jobs of type A (which eventually go to the low-pri queue) and later 5 jobs of type B (which will run off the default queue), all scheduled for say, 1PM, the type B jobs will not get enqueued until type A are all enqueued.
Here are some initial ideas:
- Use 2 Hangfire instances – technically not possible from what I understand, as configuration is global. Not ideal.
(UNRELATED: Wish list – Hangfire configuration must not be global, but to a single HangfireClient entity)
- Create another filter interface, say IDelayedJobPriorityFilter, that returns a byte priority (default 0), which can then get encoded into the score value which stores the timestamp as (timestamp + (255 - priority)/1000).
Of course, these will be independent of queues, since queues are only calculated when the job transitions to the Enqueued state. However I could probably calculate this value by reading the Queue attribute within my application code, so maybe its not a big deal.
- Run 1 Delayed Job Scheduler per Queue name the server listens to. But this requires the Queue name to be resolved before it hits the Enqueued state, which seems like a much bigger change, but it also means the Queues can run independently
Curious to hear thoughts.