We have Hangfire servers (version 1.8) running as Windows service from 4 servers (each with 60 workers). Hangfire uses its own dedicated database on SQL server 2016.
We started having an issue where all jobs get processed very slowly. Some of the observations are: We see numerous xp_userlock wait types in SQL server now. I tried to manually query Hangfire tables and I am unable to query even a single record from the JobQueue table. The following query takes forever and never returns: select top 1 Id from [HangFire].[JobQueue]. If I add nolock to the query, it immediately returns. In SQL server activity monitor, there are no blockers (i.e., blocked by, head blocker, etc) under Processes. We tried everything including deleting plan cache, server restart, etc., but nothing helps. We also see a good amount of ASYNC_NETWORK_IO wait with the below query:
WITH cte as
(
SELECT j.Id,
row_number() over (order by j.Id desc) as row_num
FROM [HangFire].Job j
WITH
(
nolock,
forceseek
)
WHERE j.StateName = @stateName
)
SELECT j.*,
s.Reason as StateReason,
s.Data as StateData,
s.CreatedAt as StateChanged
FROM [HangFire].Job j
WITH
(
nolock,
forceseek
)
INNER JOIN cte
ON cte.Id = j.Id
LEFT JOIN [HangFire].State s
WITH
(
nolock,
forceseek
)
ON j.StateId = s.Id
AND j.Id = s.JobId
WHERE cte.row_num BETWEEN @start AND @end
ORDER BY j.Id desc