Hey,
We’re seeing excessive sql server waits on LCK_M_X with the following query and were wondering if anyone else has seen anything similar. We see this when jobs are running in batches.
Our database runs in Azure SQL, so by default read committed snapshot isolation is turned on, however this query is essentially disabling it with it’s hint.
This environment currently uses Hangfire 1.7.17.
@Andrew_Borland can you show me your configuration code related to Hangfire, especially one that’s used when calling the UseSqlServerStorage
method?
Hey, sure, we have it hooked up to our configuration management but essentially it is:
.UseSqlServerStorage(() => new SqlConnection(connection string), new SqlServerStorageOptions
{
SlidingInvisibilityTimeout = TimeSpan.FromMilliseconds(300000)
QueuePollInterval = TimeSpan.Zero,
CommandBatchMaxTimeout = null,
DisableGlobalLocks = true,
UseRecommendedIsolationLevel = true,
UsePageLocksOnDequeue = false,
UseFineGrainedLocks = false
})
We’ve also noticed issues in our staging environment running batches that we observe poor performance on larger batches (this one for example has over 2 million jobs).
Looking into SQL server we see there’s a wait resource on an application lock, while a single session is running the following query:
(@key nvarchar(4000))select count(*) from [HangFire].[Set] with (readcommittedlock, forceseek) where [Key] = @key
Looking at the locking issue in our staging environment, this appears to be because TryFinishBatch in BatchJobSupportAttribute does it’s set counting with an application level lock in place from the caller method (OnStateApplied). This query on the pending jobs key has a plan cost of 40 and can take from 1 to 3+ seconds to execute.
Just a note on this, we’ve also noticed Redgate sql monitoring is performing fairly poorly while Hangfire batches run, using ~35% cpu of a 8 core Azure SQL instance. Top queries point towards it gathering wait statistics, and I think it’s related to the LCK_M_X waits we see from hangfire app locks.
Andrew, first of all thank you for such a comprehensive investigation!
Please try setting the CommandBatchMaxTimeout
to some value like TimeSpan.FromMinutes(5)
to issue transactions in almost one roundtrip to a database server and avoid having increased transaction lifetime caused by “chatty” transaction behaviour (default one of ADO.NET). This will tell Hangfire.SqlServer to use internal class SqlCommandSet
that knows how to execute multiple SQL commands as a single large command.
I will remove unnecessary READCOMMITTEDLOCK
query hints. There are two queries where this hint is important (GetJobData
and GetStateData
), but for other queries this hint was to reassure the correct behaviour, and now I’m sure they are unnecessary for all the other queries. I will make the changes on Fri or Mon and release. as 1.7.24 likely next week.
Regarding the SELECT COUNT(*) FROM [Set]
query you are right, and in 1.8.0 that will be released this Autumn new command will be implemented to avoid such expensive queries. But meanwhile you can split large batches into multiple sub-batches by using the Nested Batches feature to keep the size of each batch relatively low (1000-5000 jobs in a batch). But SlidingInvisibilityTimeout
will also help in this case.
My production server was very slow in the last days, this lock LCK_M_X was consuming the whole server, I had to restart the server to restore the memory, looks like a lot of hangfire job throwing error and then retry gave the sql a hard time
So we created a hotfix version using nested batches on staging and we do see performance improvements but batches are still causing LCK_M_X waits and normal jobs are significantly faster
The orange part of the graph is our batch, and it’s slower while creating the nested batches (around 500k jobs total for this test with 1k jobs per nested batch), then you can see the performance improves once the creation of batches completes. But if we take the blue section of the graph afterwards, these are just normal jobs that are not part of a batch and they are significantly faster (they ended up all firing at once as their queue is lower priority then the queue processing the batch).
We also see while the batches are running there are still issues with LCK_M_X, likely from distributed locks over the batch ids (although this is historical data so I can’t confirm that).
The graphs in our SQL panels have issues with time intervals so appear to be missing data, but it’s essentially a continuous graph. All the averages on the sql labels are half of their correct value because of the interpretation of missing data points from this issue.
@Andrew_Borland did you try to set the CommandBatchMaxTimeout
option?
Hey,
Yes, this setting was already applied once we reviewed everything.
Thanks for your reply. I’ve also just committed changes to remove the “readcommittedlock” table hint usage from a lot of queries and I will release the changes with 1.7.24 hopefully tomorrow, but it unlikely will help you since you already use the CommandBatchMaxTimeout
option – they will not eliminate the usage of those application locks by batches, because additional synchronization is required for batches, unlike for regular jobs, to properly access the underlying data structures.
We did some tests to remove this job from batches and just enqueue as a normal job and now hangfire processes them faster then we can insert them, while we were getting maybe 500-600rpm in the previous test with nested batches, we are seeing throughput without batches of around 3000-3500 now.
We’ve also been looking into the sliding windows as part of the ace library and it looks like there may be issues here too, when we have a sliding window on these jobs we see LCK_M_X waits and throughput drops, even though the sliding window has been set to a high value that would never hit the point of throttling.
Unfortunately yes, synchronization I’ve told you in the previous response and overhead required for batches to work degrade the performance. I slowly improve it, and there will be some changes after Hangfire 1.8 and Hangfire.Pro 3.0 is released, where I will tune some queries, but there will be no magical solution.
Also this is how concurrent data structures work in general – by using locks – and by incurring big penalties when acquiring and releasing them. Even lock-free and wait-free algorithms don’t help when external storages are used in Hangfire and require even more queries. And better data structures for parallelism aren’t implemented yet even in paper, so there’s no silver bullet here.
To avoid LCK_M_X reports, you can only move Hangfire’s database to another server or consider using Redis instead. Those application locks are important to decrease the load on SQL Server and can’t be avoided.
SQL Azure is also slow for writes, and higher plans can help you to increase it. Also I’ve seen that SQL on Virtual Machine gives better throughput, but unfortunately don’t know the actual numbers. And Redis in clouds works much better than RDBMS.