I believe I've found a bug in Hangfire.Pro.
We use Hangfire.Pro very extensively within our stack, with plenty of batching and continuations.
What I did was create ~44 Batches B1 > B2 > .... > B44. Each batch awaits the previous batch to complete before starting.
Within each batch are ~5000 Jobs, J1 > J2 > J3 > .... > J5000. Again, each job waits for the previous job to complete before starting.
Now this has happened to us twice, where, 1 batch (out of the 44), will have each job created with 2 continuations, each continuation pointing to the same job id. So for example, J2 will have 2 continuations, both J3, and J3, to 2 J4s etc.
This is the relevant data from Hangfire.SqlServer storage backend:
"SELECT Value FROM hangfire.JobParameter WHERE JobId = 12382621 AND Name = 'Continuations'" yields:
Now this eventually leads to the following exception when the job itself executes
Hangfire.Server.Worker : Warn, Error occurred during execution of 'Worker #279fcaeb' process. Execution will be retried (attempt 3 of 2147483647) in 00:00:09 seconds., System.ArgumentException: An item with the same key has already been added. at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource) at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add) at Hangfire.ContinuationsSupportAttribute.**ExecuteContinuationsIfExist**(ElectStateContext context) at Hangfire.ContinuationsSupportAttribute.OnStateElection(ElectStateContext context) at Hangfire.States.StateMachine.ApplyState(ApplyStateContext initialContext) at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context, BackgroundJob backgroundJob, IState toState, String oldStateName) at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context) at Hangfire.Server.Worker.Execute(BackgroundProcessContext context) at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
This is because ContinuationsSupportAttribute:ExecuteContinuationsIfExist has a variable called nextStates of type Dictionary. And when both continuations are processed, it eventually leads to a crash.
Now when this crash happens, a LOT of BAD things happen:
- The worker keeps repeatedly retrying the same job endlessly (for e.g.: Execution will be retried (attempt 3 of 2147483647))
- The user code behind the job keeps executing for every try
- Any other Hangfire Job creation intermittently blips with Wait Timeout exceptions on SQL Server. This is another thing I found surprising -- the DB load was very low (~10%), so I wonder why there needs to be a Wait Timeout when other jobs insert. Is there some sp_applock that is being held here?
This issue has happened to me exactly twice, once few months back, and once yesterday, both causing our systems significant downtime. Luckily I still had the data lying around from that time, so I could correlate the bugs.
Yesterday's crash happened on version 1.6.5. I just upgraded to the new version, however I don't see any fixes made on this file.
I'm still investigating what could lead to a job having 2 continuations with the same ID. For now, I'm adding an emergency fix on my side to filter out multiple continuations with the same ID so I dont have a repeat of this.
Some questions I would appreciate some thoughts on
1. A crash in hangfire worker code leads to Int32.MaxValue retries. Is there a way we can control this? Shouldn't this also have the same value as # of Job Retries
2. Is a global SQL Server app lock being held during such state transitions?
3. Any idea what could have triggered 2 continuations? The rest of my 43 batches were fine, only 1 batch got affected, and all jobs within the batch are affected.