Reliability of background job processing

This version can be considered as DRAFT, because it contains terminology burden and a lot of grammar mistakes. But if you have enough patience, you can read it. But you have been warned ;).

In this message I will speak only about reliability of HangFire itself – your storage may be corrupted, and HangFire can not do anything in this case.

To keep things under control, HangFire always tries to keep data in a consistent way. To achieve this, it uses atomic calls whenever possible. If atomic call is not possible due to some constraints (for example, Redis can’t prevent us from dirty reads), there are some compensation mechanisms, and this kind of writes I call semi-atomic.

Background job creation

BackgroundJobClient.Create method (and other extensions, such as Enqueue and Schedule) should be considered as atomic. Once it has been completed, your job is written to a storage, and it will be processed, sooner or later. If it throws any exception, then your job will not be processed.

However, strictly saying, this method is only considered to be atomic, because it actually consists of two calls:

  1. Atomic: Create a job in “Created” state and set its expiry to 1 hour.
  2. Atomic: Change its state to a given one and remove expiration.

All this atomic calls are implemented as a single statement or use atomic transaction. In Redis implementation, the multi command is being used to provide atomicity, SQL Server implementation uses TransactionScope class to wrap commands in regular transactions.

When the host process is being terminated in-between these calls, the job remains to stay in “Created” state (that has no handlers at all), and it will be expired after a given interval. When the second atomic statement is being called after the job expiration, there should be an exception.

Background job processing

The way HangFire Server will process the job is defined by its state.

Enqueued State

Jobs in Enqueued state are processed by a couple of workers. Each worker performs the following process:

  1. Fetch the next job and make it invisible for other workers
  2. Change state from “Enqueued” or “Processing” to “Processing”, if not - go to step 5.
  3. Perform a job
  4. Change state from “Processing” to “Succeeded” or “Failed”
  5. Remove a job from the queue.

There is no efficient way to make the whole process atomic. Redis implementation of this big kind of atomicity leads to big amount of retries caused by the watch command that requires to re-run the transaction each time the value under watch was changed. SQL Server implementation requires us to put the whole process in a sql transaction, and gives locks to prevent dirty reads. However, it is a very long-running transaction, and there are many excessive locks and deadlock possibilities.

So, we keep things simple, and provide compensation to reduce locking and unnecessary retries. And each step uses atomic or semi-atomic storage writes. Let’s discuss them.

1. Job fetching

SQL Server implementation provides full atomicity of this step with using UPDATE statement with OUTPUT clause. During this phase, worker gets the next job and sets the fetched timestamp (that participates in invisibility timeout process). To be reentrant, it contains invisibility timeout – when the process terminates after operation invocation, the job will be available again after expiration of this timeout.

Redis implementation is not atomic itself, and consists from the following steps:

  1. Fetch the next job with BLPOPRPUSH command, moving it to the fetched list (where aborted jobs can be found).
  2. Sets the fetched timestamp.

In Redis implementation, reentrancy is more complex, because it has no efficient atomic “find-and-modify” operation for lists (watch command on a job queue is no efficient). For this implementation, there is another component, Fetched Jobs Watcher, that watches all “fetched lists” and re-queues operations when fetched timestamp was timed out.

But is the process was terminated between steps 1 and 2, than the job has no fetched timestamp set. In this case, the queued job is indeterminate state, because we can not guess about the step 2: it can be performed after some time, or never performed at all. So, we set checked timestamp and wait for the second pass. If on the second pass it has fetched timestamp set, then we ignore checked timestamp in the future. Otherwise we move the job back to its queue, if it was expired by checked timestamp.

2. Change state to Processing

Each state change is not atomic itself. This process can be configured with user code, and I don’t want to prolong transaction lifetime or retry it many times (described above). It consist of the following steps:

  1. Acquire a distributed lock for the job
    a. Get a job state.
    b. Compare actual state to expected ones, does not match – exit.
    c. Change the state
  2. Release the lock.

Distributed lock is held there to prevent dirty reads. Since any state transition is processed within a lock, there is a guarantee that fetched job state will be always actual. Yes, this is a weak guarantee, because you can change the state between 1a and 1b manually. But you simply should not violate encapsulation and use only HangFire API to make changes.

State comparison is a compensation to provide safe state transitions on a top level. If someone (user or another process) change the state of a job during processing, everything remains under control. I.e. when user changes the state of the job that was fetched, but not moved to the Processing state yet, to Failed, then it will not be processed, and this is expected behavior.

Actual state change is implemented using so called Write-Only Transaction (to fully correspond with Redis semantics) and is atomic. Since this is the only step that actually changes the data, all state changing process can be called atomic.

3. Job performance

Job performance is an extendable process that aimed to perform a background job, i.e. call its methods with given arguments. This process is controlled by user-defined filters. And the filters itself should ensure their correctness. No other writes are made.

4. Set the result state

State transition is described earlier and is atomic operation. To ensure correctness of this operation, the state is being changed only for jobs, whose state is Processing.

5. Remove a job from queue

Job removal is atomic operation, and there by it can not lead to inconsistent data.

Points of failure

Of course, since the whole operation is non-atomic, it can lead to inconsistent state. But let’s discuss the failure points of the whole process to ensure that it has corresponding compensation for every case to fight with inconsistence. We’ll start from the end. Since all steps itself provide guarantees that they don’t corrupt data if terminated in-between, we can consider only failures that happens between them.

5-x. Job is processed and removed from the queue. This is a success point, and the whole process may be terminated gracefully at this point.

4-5. Job was processed, its state was changed to result state (succeeded or failed), but it was not removed from the queue. In this case, the job is still in the queue. After invisibility timeout expiration, it will be fetched by a worker, and the processing will be stopped at step 2, because of the state mismatch, and will go to the step 5 without performance. As a result, it will be deleted.

3-4. Job is performed, but still in “Processing” state. After invisibility timeout expiration, it will be fetched by another worker, and would pass the second step, because it contains the transition from the “Processing” state.

2-3. The state is “Processing”, and job is not performed. As in 3-4 failure point.

1-2. The state is “Enqueued”, and job was not performed. As in 3-4 failure point, because it contains the transition from the “Enqueued” state.

Scheduled state

This state contains only one write operation – transition from Scheduled to the Enqueued state, that is atomic itself.

1 Like

man, i fail to see the grammatical errors you warned us about, on the contrary, i am so pleased with the level of details and your philosophy in analysing the whole article.

this article is a great source of information about background processing and i have learned a lot from it in terms of concept and philosophy.

if i may humbly suggest that you really should take a look at RavenDB it is fully CID and Atomic by nature, specially the version 3.0 of it, and to add more, now NService bus internally depends totally on it. so if Enterprise Service bus like Nservice bus uses RavenDB as its infrastructure, i think Hangfire will benefit a lot from that.

and i think you should move to Owin pipeline, if you read the unbeliveable new ASP.NET announcment today, i am sure you will agree

if i may humbly suggest that you really should take a look at RavenDB it is fully CID and Atomic by nature, specially the version 3.0 of it, and to add more, now NService bus internally depends totally on it. so if Enterprise Service bus like Nservice bus uses RavenDB as its infrastructure, i think Hangfire will benefit a lot from that.

RavenDB is cool, but its implementation requires some effort that have opportunity costs. As for now, HangFire supports SQL Server (simple, but uses polling), SQL Server + MSMQ (average simplicity, low latency) and Redis (fast, low latency, but requires additional infrastructure); and this fully covers the basic needs.

I have no plans to implement it at all, but it will be awesome, if it will be implemented as a community contribution.

and i think you should move to Owin pipeline, if you read the unbeliveable new ASP.NET announcment today, i am sure you will agree

Heh, that is the incredible announcement! Some time ago I thought that OWIN support is “far-future need”, so I was wrong, and I’m happy with it :slight_smile: Need to update the feature status, since ASP.NET vNext requires it.

thanks a lot for the Detailed reply,

personally i am to naive to understand if SQL Server is adequate enough :slight_smile: but i take it from your word is that SQL Server + MSMQ will solve latency and consistency problems, so it is good enough.

the ASP.NET VNext shocked everyone, and all was htinking the same that owin is very far, and htis brings me to the idea, htat if Hangfire will be done in OWIN, then it should benefit from the cross platform feature the VNext will offer, and as you know SQL Server does not exist outside windows, so what is your advise? should this be a topic of its own ?

Redis, it works on both Linux and Mac OS! SQL Server job storage implementation was added only because of two facts:

  • There was no production-ready Redis version for Windows (but now it is available).
  • Redis setup requires additional learning and project infrastructure changes.

thanks,

so what you suggest is that once hangfire moves to OWIN and VNext, then if i want to run outside Windows i must choose Redis.

also if you checked the news Microsoft released Official Caching provider for Redis

i just came to this article about VNext where they say that they will reimplement all System.Web functionalaty in VNext

http://davidfowl.com/asp-net-vnext/

Omg, this is incredible! Some time ago I almost moved to Ruby on Rails, because ASP.NET MVC progress was very lazy. But now… I can’t believe this is real!

i am so glad you liked it :smile:

Shouldn’t the information provided here be part of the documentation? Its nice to find the documentation in one place. Like right now I am searching the forum for the post that describes how to write a new storage provider. I remember seeing it here some where, would be very handy if its listed under documentation.