We have a Hangfire job that spins up multiple jobs inside a single batch. Lately the batch exceeds 300K jobs in total.
We’re using SQL for storage, and have applied additional settings as detailed in here to boost our performance on hangfire overall (which generally works pretty well).
Outside of a batch, I can see job throughput (with 20 server x18 workers per server) hit around 11.5K jobs per minute. Within the batch, we typically top out at 1K jobs per minute (sometimes not even that, only 750 on the last few).
Some investigations from our devs have pointed out that we’re seeing a ton of LCK_M_X waits from Hangfire.
Further digging into decompiled code indicated that there every job in the batch executes a
sp_applock during the
TryFinishBatch method. In our staging environment (where we’re testing with ~2.7M jobs), these locks take 1-4 seconds to complete.
We’re trying to look into a switch to redis for these jobs, but its going to take a while to get the code ready and deployed. In the meantime, we’re stuck with a job that doesn’t max us out on throughput, and takes many hours to complete (around 8.5 hours when we’re at 1K JPM).
Is there anything we can do to unlock additional throughput without needing to deploy new code?
We’re using Hangfire 1.7.17