SQL Server Waits - LCK_M_X on JobParameter table

Andrew_Borland · June 21, 2021, 11:09am

Hey,

We’re seeing excessive sql server waits on LCK_M_X with the following query and were wondering if anyone else has seen anything similar. We see this when jobs are running in batches.

Our database runs in Azure SQL, so by default read committed snapshot isolation is turned on, however this query is essentially disabling it with it’s hint.

Andrew_Borland · June 21, 2021, 11:18am

This environment currently uses Hangfire 1.7.17.

odinserj · June 22, 2021, 7:11am

@Andrew_Borland can you show me your configuration code related to Hangfire, especially one that’s used when calling the UseSqlServerStorage method?

Andrew_Borland · June 22, 2021, 8:22am

Hey, sure, we have it hooked up to our configuration management but essentially it is:

                .UseSqlServerStorage(() => new SqlConnection(connection string), new SqlServerStorageOptions
                {
                    SlidingInvisibilityTimeout = TimeSpan.FromMilliseconds(300000)
                    QueuePollInterval = TimeSpan.Zero,
                    CommandBatchMaxTimeout = null,
                    DisableGlobalLocks = true,
                    UseRecommendedIsolationLevel = true,
                    UsePageLocksOnDequeue = false,
                    UseFineGrainedLocks = false
                })

Andrew_Borland · June 22, 2021, 2:09pm

We’ve also noticed issues in our staging environment running batches that we observe poor performance on larger batches (this one for example has over 2 million jobs).

Looking into SQL server we see there’s a wait resource on an application lock, while a single session is running the following query:

(@key nvarchar(4000))select count(*) from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key

Andrew_Borland · June 22, 2021, 2:51pm

Looking at the locking issue in our staging environment, this appears to be because TryFinishBatch in BatchJobSupportAttribute does it’s set counting with an application level lock in place from the caller method (OnStateApplied). This query on the pending jobs key has a plan cost of 40 and can take from 1 to 3+ seconds to execute.

Andrew_Borland · June 22, 2021, 6:37pm

Just a note on this, we’ve also noticed Redgate sql monitoring is performing fairly poorly while Hangfire batches run, using ~35% cpu of a 8 core Azure SQL instance. Top queries point towards it gathering wait statistics, and I think it’s related to the LCK_M_X waits we see from hangfire app locks.

odinserj · June 22, 2021, 7:35pm

Andrew, first of all thank you for such a comprehensive investigation!

Please try setting the CommandBatchMaxTimeout to some value like TimeSpan.FromMinutes(5) to issue transactions in almost one roundtrip to a database server and avoid having increased transaction lifetime caused by “chatty” transaction behaviour (default one of ADO.NET). This will tell Hangfire.SqlServer to use internal class SqlCommandSet that knows how to execute multiple SQL commands as a single large command.

I will remove unnecessary READCOMMITTEDLOCK query hints. There are two queries where this hint is important (GetJobData and GetStateData), but for other queries this hint was to reassure the correct behaviour, and now I’m sure they are unnecessary for all the other queries. I will make the changes on Fri or Mon and release. as 1.7.24 likely next week.

Regarding the SELECT COUNT(*) FROM [Set] query you are right, and in 1.8.0 that will be released this Autumn new command will be implemented to avoid such expensive queries. But meanwhile you can split large batches into multiple sub-batches by using the Nested Batches feature to keep the size of each batch relatively low (1000-5000 jobs in a batch). But SlidingInvisibilityTimeout will also help in this case.

lfbicalho · June 23, 2021, 11:44am

My production server was very slow in the last days, this lock LCK_M_X was consuming the whole server, I had to restart the server to restore the memory, looks like a lot of hangfire job throwing error and then retry gave the sql a hard time

Andrew_Borland · June 24, 2021, 8:42am

So we created a hotfix version using nested batches on staging and we do see performance improvements but batches are still causing LCK_M_X waits and normal jobs are significantly faster

The orange part of the graph is our batch, and it’s slower while creating the nested batches (around 500k jobs total for this test with 1k jobs per nested batch), then you can see the performance improves once the creation of batches completes. But if we take the blue section of the graph afterwards, these are just normal jobs that are not part of a batch and they are significantly faster (they ended up all firing at once as their queue is lower priority then the queue processing the batch).

We also see while the batches are running there are still issues with LCK_M_X, likely from distributed locks over the batch ids (although this is historical data so I can’t confirm that).

The graphs in our SQL panels have issues with time intervals so appear to be missing data, but it’s essentially a continuous graph. All the averages on the sql labels are half of their correct value because of the interpretation of missing data points from this issue.

odinserj · June 24, 2021, 11:41am

@Andrew_Borland did you try to set the CommandBatchMaxTimeout option?

Andrew_Borland · June 24, 2021, 11:58am

Hey,

Yes, this setting was already applied once we reviewed everything.

odinserj · June 24, 2021, 12:30pm

Thanks for your reply. I’ve also just committed changes to remove the “readcommittedlock” table hint usage from a lot of queries and I will release the changes with 1.7.24 hopefully tomorrow, but it unlikely will help you since you already use the CommandBatchMaxTimeout option – they will not eliminate the usage of those application locks by batches, because additional synchronization is required for batches, unlike for regular jobs, to properly access the underlying data structures.

Andrew_Borland · June 24, 2021, 5:39pm

We did some tests to remove this job from batches and just enqueue as a normal job and now hangfire processes them faster then we can insert them, while we were getting maybe 500-600rpm in the previous test with nested batches, we are seeing throughput without batches of around 3000-3500 now.

We’ve also been looking into the sliding windows as part of the ace library and it looks like there may be issues here too, when we have a sliding window on these jobs we see LCK_M_X waits and throughput drops, even though the sliding window has been set to a high value that would never hit the point of throttling.

odinserj · June 25, 2021, 10:38am

Unfortunately yes, synchronization I’ve told you in the previous response and overhead required for batches to work degrade the performance. I slowly improve it, and there will be some changes after Hangfire 1.8 and Hangfire.Pro 3.0 is released, where I will tune some queries, but there will be no magical solution.

Also this is how concurrent data structures work in general – by using locks – and by incurring big penalties when acquiring and releasing them. Even lock-free and wait-free algorithms don’t help when external storages are used in Hangfire and require even more queries. And better data structures for parallelism aren’t implemented yet even in paper, so there’s no silver bullet here.

To avoid LCK_M_X reports, you can only move Hangfire’s database to another server or consider using Redis instead. Those application locks are important to decrease the load on SQL Server and can’t be avoided.

SQL Azure is also slow for writes, and higher plans can help you to increase it. Also I’ve seen that SQL on Virtual Machine gives better throughput, but unfortunately don’t know the actual numbers. And Redis in clouds works much better than RDBMS.

Topic		Replies	Views
Hangfire Batching Performance question	1	1617	June 22, 2021
SQL Server CPU and JobQueue Paging bug? sql-server	2	2372	December 20, 2016
Why does Hangfire.SqlServer use Read Committted Isolation mode for reads? question	3	3722	July 12, 2017
Concurrency slow down due to locking on SQL Server bug? sql-server	2	3175	June 4, 2019
Hangfire slow jobs processing and JobQueue table seems locked up question sql-server	0	171	August 8, 2024

SQL Server Waits - LCK_M_X on JobParameter table

Related topics