I’ve inherited a system that uses Hangfire with sql server job storage. Usually when a job is scheduled to be run immediately we notice it takes a few seconds before it’s triggered.
Looking at SQL Profiler when running in my dev environment, the SQL run against Hangfire db looks like this:
exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'MYQUEUENAME_master',@timeout=-1800
-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then:
exec sp_getapplock @Resource=N'HangFire:recurring-jobs:lock',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_getapplock @Resource=N'HangFire:locks:schedulepoller',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'recurring-jobs',@from=0,@to=1596053348
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'schedule',@from=0,@to=1596053348
exec sp_releaseapplock @Resource=N'HangFire:recurring-jobs:lock',@LockOwner=N'Session'
exec sp_releaseapplock @Resource=N'HangFire:locks:schedulepoller',@LockOwner=N'Session'
-- Then nothing is executed for about 8-10 seconds, then:
exec sp_executesql N'update [HangFire].Server set LastHeartbeat = @now where Id = @id',N'@now datetime,@id nvarchar(4000)',@now='2020-07-29 20:09:19.097',@id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'
-- Then about 500ms-2s later I get
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query
The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).
It looks like there’s at least 3 threads running the DELETE query against JQ, since I can see several RPC:Starting before the RPC:Completed, suggesting they’re being executed in parallel instead of sequentially.
I don’t know if that’s normal but seems weird as I thought we had just one ‘consumer’ of the jobs.
I only have one Queue in my dev environment, although in live we’d have 20-50 I’d guess.
Any suggestions on where I should look for the configuration that’s causing:
a) the 8-10s pause between checking for jobs
b) the number of threads that are checking for jobs - it seems like I have too many
thanks,
Rory