Why does Hangfire.SqlServer use Read Committted Isolation mode for reads?

I have been having some scalability issues with Hangfire.SqlServer. I’ve tracked this down to the following scenario:

  1. A SQL Transaction that say, spans 10 - 15 seconds creates a Hangfire Job
  2. This blocks a lot of Hangfire’s select statements that explicitly use READCOMMITTED Isolation mode until the transaction is completed. Note that I’m on SQL Azure, where READCOMMITTEDSNAPSHOT is turned on.

Of course, its not ideal for my transactions to last that long – working on fixing that, but I’m curious what dynamic causes Hangfire to require READCOMMITTED?

What disastrous things would happen if we leave this to the default of READCOMMITTEDSNAPSHOT?

Hangfire uses READ COMMITTED isolation level to serialize access to background jobs, preventing it from fetching stale data. This is fine, and this is the correct behavior to ensure all is working correctly. Outstanding transactions block all the queries related to background processing. Moreover, it uses the READCOMMITTED table hint to always apply shared locks, even if the “read committed snapshot” setting is enabled for the same purposes. Background monitoring queries use the NOLOCK table hint, because it is safe to show stale data in the dashboard.

What scalability issues do you have?

Thanks @odinserj.

Ok, I think I have a repro case.

Let’s assume we have a simple job:

public static void NoopJob()
    {
        // NO-OP
    }

We then declare a job that creates a batch within a transaction, we add an artificial delay of 30 seconds to demonstrate the issue:

public async Task CreateJob1()
    {
        using (var scope = new TransactionScope(TransactionScopeOption.Required, TimeSpan.FromMinutes(2), TransactionScopeAsyncFlowOption.Enabled))
        {
            var batchId = BatchJob.StartNew(c =>
            {
                for (var i = 0; i < 10; i++)
                {
                    c.Enqueue(() => NoopJob());
                }
            });

            await Task.Delay(TimeSpan.FromSeconds(30));
            scope.Complete();
        }
    }

And finally, we create a harmless method that does nothing fancy but simply creates a job:

public async Task CreateJob2()
    {
        BackgroundJob.Enqueue(() => NoopJob());
    }

Now, expected behavior when I call CreateJob1(), which blocks for 30 seconds, and call CreateJob2() within 30 seconds, there should be no blocking – and things work OK.

However, now open the Dashboard, call CreateJob1() (which blocks), and then call CreateJob2(), in that order – this should block. If it doesn’t call CreateJob2 a few times, normally blocks within 5-10 tries. The query it appears to block on is (@key nvarchar(4000))select count([Key]) from [HangFire].[Set] with (readcommittedlock) where [Key] = @key.

You’ll also notice the Dashboard is completely blocked while CreateJob1() is running – I think I narrowed that down to batches:started metric which the Dashboard tries to load.

If I remove batches from CreateJob1, there is no blocking anymore. So this scalability issue has to do when batches are used within transactions. All this is running within an ASP.NET WebAPI.

If you have any idea on what’s going on, please let me know. If not I’ll try to find a more deterministic repro case where CreateJob2 is blocked.

Repros are great as always! Default isolation level for TransactionScope class is Serializable:

The isolation level of a transaction is determined when the transaction is created. By default, the System.Transactions infrastructure creates Serializable transactions.
MSDN

This means that range locks are acquired to prevent phantom reads, and they are held until transaction is committed. So, your transaction prevents other transactions from running. Try to pass the IsolationLevel.ReadCommitted to your transaction scope instance.

Repeatable read isolation level does not apply range locks and blocks only those job records, that were modified in a transaction. Message queue implementation uses the READPAST table hint to prevent waiting on blocked records in the queue. It will wait on job records when they are modified by other transaction, but this is fine, great, by design and is essential to ensure the correctness of the processing.