Why does Hangfire pause checking for jobs from sqlserver for several seconds?

roryk · July 29, 2020, 8:37pm

I’ve inherited a system that uses Hangfire with sql server job storage. Usually when a job is scheduled to be run immediately we notice it takes a few seconds before it’s triggered.

Looking at SQL Profiler when running in my dev environment, the SQL run against Hangfire db looks like this:

exec sp_executesql N'delete top (1) JQ
output DELETED.Id, DELETED.JobId, DELETED.Queue
from [HangFire].JobQueue JQ with (readpast, updlock, rowlock, forceseek)
where Queue in (@queues1) and (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))',N'@queues1 nvarchar(4000),@timeout float',@queues1=N'MYQUEUENAME_master',@timeout=-1800

-- Exactly the same SQL as above is executed about 6 times/second for about 3-4 seconds,
-- then nothing for about 2 seconds, then: 

exec sp_getapplock @Resource=N'HangFire:recurring-jobs:lock',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_getapplock @Resource=N'HangFire:locks:schedulepoller',@DbPrincipal=N'public',@LockMode=N'Exclusive',@LockOwner=N'Session',@LockTimeout=5000
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'recurring-jobs',@from=0,@to=1596053348
exec sp_executesql N'select top (@count) Value from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key and Score between @from and @to order by Score',N'@count int,@key nvarchar(4000),@from float,@to float',@count=1000,@key=N'schedule',@from=0,@to=1596053348
exec sp_releaseapplock @Resource=N'HangFire:recurring-jobs:lock',@LockOwner=N'Session'
exec sp_releaseapplock @Resource=N'HangFire:locks:schedulepoller',@LockOwner=N'Session'

-- Then nothing is executed for about 8-10 seconds, then: 

exec sp_executesql N'update [HangFire].Server set LastHeartbeat = @now where Id = @id',N'@now datetime,@id nvarchar(4000)',@now='2020-07-29 20:09:19.097',@id=N'ps12345:19764:fe362d1a-5ee4-4d97-b70d-134fdfab2b87'

-- Then about 500ms-2s later I get 
exec sp_executesql N'delete top (1) JQ ... -- i.e. Same as first query

The update LastHeartbeat query is only there every second time (from just a brief inspection, maybe that’s not exactly right).

It looks like there’s at least 3 threads running the DELETE query against JQ, since I can see several RPC:Starting before the RPC:Completed, suggesting they’re being executed in parallel instead of sequentially.
I don’t know if that’s normal but seems weird as I thought we had just one ‘consumer’ of the jobs.

I only have one Queue in my dev environment, although in live we’d have 20-50 I’d guess.

Any suggestions on where I should look for the configuration that’s causing:
a) the 8-10s pause between checking for jobs
b) the number of threads that are checking for jobs - it seems like I have too many

thanks,

Rory

roryCT · August 10, 2020, 8:57pm

It turned out we were configuring the SqlServerStorageOptions after calling app.UseHangfireServer(), which unsurprisingly means it’s left at the default. Which for sql server is to poll every 15 seconds.

The number of threads was to do with calling app.UseHangfireServer() a bunch of times instead of just once.

PS: it’s really annoying on this forum that you can’t edit posts (too similar to existing post) or seemingly reply to your own post.

roryk · August 21, 2020, 1:12pm

UPDATE (not solved):

I was running an old version of Hangfire, 1.5.x, so I’ve now updated to 1.7.12, upgraded my database, and changed to use this initialisation config:

        app.UseHangfireDashboard();

        GlobalConfiguration.Configuration
            .UseSqlServerStorage(connstring, new SqlServerStorageOptions
            {
                CommandBatchMaxTimeout = TimeSpan.FromMinutes(5),
                QueuePollInterval = TimeSpan.Zero,
                SlidingInvisibilityTimeout = TimeSpan.FromMinutes(5),
                UseRecommendedIsolationLevel = true,
                PrepareSchemaIfNecessary = true, // Default value: true
                EnableHeavyMigrations = true     // Default value: false
            })
            .UseAutofacActivator(_container);
        JobActivator.Current = new AutofacJobActivator(_container);

The problem remains, or in fact seems to be worse: now I get 20 calls to the delete job... query but they all happen within about 1s. Then the other queries, then a pause for about 15s, then it all repeats. What’s it doing during that time!? Surely must be something weird with my configuration but I’m not sure where to look.

mellamokb · November 2, 2021, 9:12pm

Just ran into this exact same issue today. It seemed to result from very high job creation/enqueue throughput from another background job, which leads to long delays locking on that Delete query and timeouts. Ultimately it lead to a bunch of error messages downstream causing failures to enqueue new jobs.

We are currently on 1.7.19 experiencing the issue. Not sure if it has since been fixed?

roryk · November 2, 2021, 9:54pm

My problem was not a Hangfire bug, we were just using it wrong: configuring the SqlServerStorageOptions after calling app.UseHangfireServer() , which unsurprisingly means it’s left at the default. Which for sql server is to poll every 15 seconds.

Good luck with your problem. Perhaps the new defaults for how the new jobs are queried will help, if it’s a locking/concurrency problem you’re getting.

/Rory

saleh_ibrahim · November 29, 2021, 4:41pm

Dear @roryk ,
I have a same problem ,please if you find correct answer mentioned me or send me in my email salehibrahim453621@gmail.com

roryk · December 1, 2021, 2:03pm

In my case we were configuring the SqlServerStorageOptions after calling app.UseHangfireServer() , which unsurprisingly means it’s left at the default. Which for sql server is to poll every 15 seconds.

If you use the example configuration code from Hangfire it should work fine, i.e. respect the polling interval you set.

This is approximately what we have now in aspnet core solution, but we do some funny things with connectionString and queue names so not the best example to follow:

            services.AddHangfire(config => config
                                    .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
                                    .UseSimpleAssemblyNameTypeSerializer()
                                    .UseSqlServerStorage(connectionString), new SqlServerStorageOptions
                                    {
                                        QueuePollInterval = new TimeSpan(0, 0, 0, 0, 500)
                                    })
            );

            services.AddHangfireServer(options =>
            {
                options.Queues = listOfQueueNames;
                options.WorkerCount = 20;
            });

Topic		Replies	Views
Hangfire timing out on insert into jobQueue table question sql-server	1	902	November 11, 2021
Hangfire slow jobs processing and JobQueue table seems locked up question sql-server	0	161	August 8, 2024
Potentially long running Job get cancelled after 30 minutes bug?	0	522	January 7, 2024
How to extend time out on recurring job? question recurring , sql-server	0	3045	October 7, 2021
Multiple workers on one job bug?	11	4983	January 30, 2020

Why does Hangfire pause checking for jobs from sqlserver for several seconds?

Related topics