Multiple continuations are created with the same JobId for all jobs in a batch

Tags: #<Tag:0x00007faffe4f3f98> #<Tag:0x00007faffe4f3c78>

I believe I’ve found a bug in Hangfire.Pro.

We use Hangfire.Pro very extensively within our stack, with plenty of batching and continuations.

What I did was create ~44 Batches B1 > B2 > … > B44. Each batch awaits the previous batch to complete before starting.

Within each batch are ~5000 Jobs, J1 > J2 > J3 > … > J5000. Again, each job waits for the previous job to complete before starting.

Now this has happened to us twice, where, 1 batch (out of the 44), will have each job created with 2 continuations, each continuation pointing to the same job id. So for example, J2 will have 2 continuations, both J3, and J3, to 2 J4s etc.

This is the relevant data from Hangfire.SqlServer storage backend:
“SELECT Value FROM hangfire.JobParameter WHERE JobId = 12382621 AND Name = ‘Continuations’” yields:

Now this eventually leads to the following exception when the job itself executes

Hangfire.Server.Worker : Warn, Error occurred during execution of 'Worker #279fcaeb' process. Execution will be retried (attempt 3 of 2147483647) in 00:00:09 seconds., System.ArgumentException: An item with the same key has already been added. at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource) at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add) at Hangfire.ContinuationsSupportAttribute.**ExecuteContinuationsIfExist**(ElectStateContext context) at Hangfire.ContinuationsSupportAttribute.OnStateElection(ElectStateContext context) at Hangfire.States.StateMachine.ApplyState(ApplyStateContext initialContext) at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context, BackgroundJob backgroundJob, IState toState, String oldStateName) at Hangfire.States.BackgroundJobStateChanger.ChangeState(StateChangeContext context) at Hangfire.Server.Worker.Execute(BackgroundProcessContext context) at Hangfire.Server.ServerProcessExtensions.Execute(IServerProcess process, BackgroundProcessContext context) at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

This is because ContinuationsSupportAttribute:ExecuteContinuationsIfExist has a variable called nextStates of type Dictionary<JobId,IState>. And when both continuations are processed, it eventually leads to a crash.

Now when this crash happens, a LOT of BAD things happen:

  1. The worker keeps repeatedly retrying the same job endlessly (for e.g.: Execution will be retried (attempt 3 of 2147483647))
  2. The user code behind the job keeps executing for every try
  3. Any other Hangfire Job creation intermittently blips with Wait Timeout exceptions on SQL Server. This is another thing I found surprising – the DB load was very low (~10%), so I wonder why there needs to be a Wait Timeout when other jobs insert. Is there some sp_applock that is being held here?

This issue has happened to me exactly twice, once few months back, and once yesterday, both causing our systems significant downtime. Luckily I still had the data lying around from that time, so I could correlate the bugs.

Yesterday’s crash happened on version 1.6.5. I just upgraded to the new version, however I don’t see any fixes made on this file.

I’m still investigating what could lead to a job having 2 continuations with the same ID. For now, I’m adding an emergency fix on my side to filter out multiple continuations with the same ID so I dont have a repeat of this.

Some questions I would appreciate some thoughts on

  1. A crash in hangfire worker code leads to Int32.MaxValue retries. Is there a way we can control this? Shouldn’t this also have the same value as # of Job Retries
  2. Is a global SQL Server app lock being held during such state transitions?
  3. Any idea what could have triggered 2 continuations? The rest of my 43 batches were fine, only 1 batch got affected, and all jobs within the batch are affected.

@tejasxp, thank you a lot for such a detailed bug report. I’ll investigate the root of the problem on Monday and release a fix ASAP. I’ll fix the ContinuationsSupportAttribute class to not to add a duplicate continuation if exist to handle existing workloads, and will investigate why it’s added twice. Looks like there’s a race condition, since the problem happen rarely, but strange that all the jobs are affected. And looks like this is a Hangfire.Core problem, but may affect batch continuations also, since code is almost the same, so I’ll fix them also.

Timeout exceptions happen due to sp_getapplock stored procedure that is used to provide the distributed locks feature, so it shouldn’t affect other database queries.

Thanks @odinserj

One more question - what is the critical resource being locked with sp_getapplock when state changes happen? Do you know of a command that can list currently held locks?

Trying to understand the bottlenecks as my system scales, and on whether we can go towards a lock-less design – basically for us, we’d rather have job executions fail / take a small performance hit than have any insert operations fail.

Hi @tejasxp, I’ve just released Hangfire 1.6.10 with fix for continuations. Job continuations are added outside of a change-state transaction due to distributed lock lifetime nuances. And it is possible that continuation is added, but the outer transaction was failed. If we mix this with retry mechanics, we’ll have duplicate records for job continuations.

I’ve added two changes: first will prevent to add multiple continuations with the same job id to work correctly for new continuations, and second change will skip duplicate records, to ensure that duplicate records will not lead to any exception. I’ll also consider how to isolate such exceptions to isolate the consequences, and proceed with background processing seamlessly.

Regarding locks, every background job is protected by a distributed lock during the state change to prevent race conditions. There are no global locks, they are granular, one per background job.

@tejasxp, there was a problem with state changing pipeline, when a buggy filter could cause infinite retries, when non-transient exception is thrown. I’ve just released Hangfire 1.6.12 with a fix for this problem. Now, if there was an exception during the state apply, the apply logic will be retried for 10 times. If it’s still failing, background job will be moved to the Failed state without calling any state filter. So buggy filters will not cause cascading failures anymore.