Why does Hangfire pause checking for jobs from sqlserver for several seconds?

Tags: #<Tag:0x00007efd23d97a70>

I’ve inherited a system that uses Hangfire with sql server job storage. Usually when a job is scheduled to be run immediately we notice it takes a few seconds before it’s triggered.

Looking at SQL Profiler when running in my dev environment, the SQL run against Hangfire db looks like this:

exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'MYQUEUENAME_master',@timeout=-1800

-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then: 

exec sp_getapplock @Resource=N'HangFire:recurring-jobs:lock',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_getapplock @Resource=N'HangFire:locks:schedulepoller',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'recurring-jobs',@from=0,@to=1596053348
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'schedule',@from=0,@to=1596053348
exec sp_releaseapplock @Resource=N'HangFire:recurring-jobs:lock',@LockOwner=N'Session'
exec sp_releaseapplock @Resource=N'HangFire:locks:schedulepoller',@LockOwner=N'Session'

-- Then nothing is executed for about 8-10 seconds, then: 

exec sp_executesql N'update [HangFire].Server set LastHeartbeat = @now where Id = @id',N'@now datetime,@id nvarchar(4000)',@now='2020-07-29 20:09:19.097',@id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'

-- Then about 500ms-2s later I get 
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query

The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).

It looks like there’s at least 3 threads running the DELETE query against JQ, since I can see several RPC:Starting before the RPC:Completed, suggesting they’re being executed in parallel instead of sequentially.
I don’t know if that’s normal but seems weird as I thought we had just one ‘consumer’ of the jobs.

I only have one Queue in my dev environment, although in live we’d have 20-50 I’d guess.

Any suggestions on where I should look for the configuration that’s causing:
a) the 8-10s pause between checking for jobs
b) the number of threads that are checking for jobs - it seems like I have too many

thanks,

Rory

It turned out we were configuring the SqlServerStorageOptions after calling app.UseHangfireServer(), which unsurprisingly means it’s left at the default. Which for sql server is to poll every 15 seconds.

The number of threads was to do with calling app.UseHangfireServer() a bunch of times instead of just once.

PS: it’s really annoying on this forum that you can’t edit posts (too similar to existing post) or seemingly reply to your own post.

UPDATE (not solved):

I was running an old version of Hangfire, 1.5.x, so I’ve now updated to 1.7.12, upgraded my database, and changed to use this initialisation config:

        app.UseHangfireDashboard();

        GlobalConfiguration.Configuration
            .UseSqlServerStorage(connstring, new SqlServerStorageOptions
            {
                CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
                QueuePollInterval = TimeSpan.Zero,
                SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
                UseRecommendedIsolationLevel = true,
                PrepareSchemaIfNecessary = true, // Default value: true
                EnableHeavyMigrations = true     // Default value: false
            })
            .UseAutofacActivator(_container);
        JobActivator.Current = new AutofacJobActivator(_container);

The problem remains, or in fact seems to be worse: now I get 20 calls to the delete job... query but they all happen within about 1s. Then the other queries, then a pause for about 15s, then it all repeats. What’s it doing during that time!? Surely must be something weird with my configuration but I’m not sure where to look.