SQL Server CPU and JobQueue Paging

I have encountered a problem where the database server CPU rises to 100%, and I believe the root cause has to do with how the JobQueue indexes are paged and how locks are held. This is the scenario:

  • A large number of workers (~300-600) process a heterogeneous mixture of short-running and long-running jobs, the time being based on the input. The rate of requests processed is roughly 70/second.
  • The longer-running jobs hold locks on the JobQueue table. At the same time, the high rate of jobs increases the page count and fragmentation of the indexes on the JobQueue table.
  • When the workers try to dequeue jobs, they need to perform many more reads because of the page count and fragmentation of the indexes. This results in each dequeue using more of the CPU on the database server.
  • Each dequeue query also attempts to obtain a lock on the JobQueue table, effectively ensuring the index pages will never be consolidated. This creates a fast snowball effect on the database server CPU.

Unfortunately, my attempts to reproduce this outside of a production environment have, so far, failed. Even at production load, the problem only happens once in a while (every 1-2 days), and it is not directly caused by higher volumes or long-running jobs, although the correlation appears to hold.

At the moment, the only solution that I have is to periodically re-build the indexes on the JobQueue table while killing all spids blocking the re-build query. This is actually relatively safe, as Hangfire is resilient to this type of random noise. That said, it means I can’t use Hangfire out of the box without this modification.

Would it be possible to either include a similar mitigation as a background thread in Hangfire itself or to change the locking/indexing on the JobQueue table to prevent this from happening?

I haven’t tried this yet, but our DBA suggested the following:

This seems like it should work. I’ll update once I have evidence to that effect.

The index is probably irrelevant. FetchedAt is unused, and we only have one Queue. The issue is most likely related to fragmentation of locks on the table, but we don’t have any definitive answers yet.