We have a Hangfire instance running on Hangfire 1.7.9 (Using an ACE license as far as I’m aware). This instance is run out of a SQL Server database, and has 7 servers hooked in (each with 32 workers, and set up for all queues).
When we have a large backlog on one of the queues (~130k), and no other items in the other queues, I would expect to see our processing items sitting at close to 224 (7 servers * 32 workers). However, we rarely see the Processing section climb above 150, and normally sits around 120.
I appreciate that Processing doesn’t necessarily paint the complete picture, as it probably doesn’t account for things like polling for the next job, etc.
Is there a better way to work out what the workers are actually doing, and whether we’re being bottlenecked somewhere.
First guess would be you’re hitting the default limit for threads per process in IIS.
We’re running the actual workers off a windows service, rather than from within the web api project, is that likely to hit the same limit, or an equivalent limit in the windows service?
No, there shouldn’t be a (relevant) thread limit in a Windows Service like there would be for IIS. I doubt you’re hitting a problem where “workers are doing other things” but might be hitting database query limits for updating state and pulling new Jobs.
How long are your Jobs processing for?
Looking at the logs, they seem to be sub 1 second, usually 1-200ms
So, the reason we get 100k jobs in the queue is because we’re running a peak load test through the system, which generates around 30 odd RPS. It starts to back up and just climbs.
My original thinking was that because the jobs table is high traffic (in and out), they’re just all stepping on each other. That does seem to be visible in the db, with many sessions getting locked by others.
However, after the peak load completes, there’s little to no traffic on the system, so Hangfire has its pick. I don’t see the processing count climb above 150, maybe 170 occasionally.
We managed to figure it out what was causing the backlog. We’re using v1.7, but hadn’t seen the recommended changes in the config:
UseRecommendedIsolationLevel = true
UsePageLocksOnDequeue = true
DisableGlobalLocks = true
I ran a performance test on Friday with these new settings. In the test env, after an hour I had a 19k backlog. With these settings enabled, I had none.
Nice work, glad you came to a resolution!