This version can be considered as DRAFT, because it contains terminology burden and a lot of grammar mistakes. But if you have enough patience, you can read it. But you have been warned ;).
In this message I will speak only about reliability of HangFire itself – your storage may be corrupted, and HangFire can not do anything in this case.
To keep things under control, HangFire always tries to keep data in a consistent way. To achieve this, it uses atomic calls whenever possible. If atomic call is not possible due to some constraints (for example, Redis can’t prevent us from dirty reads), there are some compensation mechanisms, and this kind of writes I call semi-atomic.
Background job creation
BackgroundJobClient.Create method (and other extensions, such as
Schedule) should be considered as atomic. Once it has been completed, your job is written to a storage, and it will be processed, sooner or later. If it throws any exception, then your job will not be processed.
However, strictly saying, this method is only considered to be atomic, because it actually consists of two calls:
- Atomic: Create a job in “Created” state and set its expiry to 1 hour.
- Atomic: Change its state to a given one and remove expiration.
All this atomic calls are implemented as a single statement or use atomic transaction. In Redis implementation, the
multi command is being used to provide atomicity, SQL Server implementation uses
TransactionScope class to wrap commands in regular transactions.
When the host process is being terminated in-between these calls, the job remains to stay in “Created” state (that has no handlers at all), and it will be expired after a given interval. When the second atomic statement is being called after the job expiration, there should be an exception.
Background job processing
The way HangFire Server will process the job is defined by its state.
Jobs in Enqueued state are processed by a couple of workers. Each worker performs the following process:
- Fetch the next job and make it invisible for other workers
- Change state from “Enqueued” or “Processing” to “Processing”, if not - go to step 5.
- Perform a job
- Change state from “Processing” to “Succeeded” or “Failed”
- Remove a job from the queue.
There is no efficient way to make the whole process atomic. Redis implementation of this big kind of atomicity leads to big amount of retries caused by the
watch command that requires to re-run the transaction each time the value under watch was changed. SQL Server implementation requires us to put the whole process in a sql transaction, and gives locks to prevent dirty reads. However, it is a very long-running transaction, and there are many excessive locks and deadlock possibilities.
So, we keep things simple, and provide compensation to reduce locking and unnecessary retries. And each step uses atomic or semi-atomic storage writes. Let’s discuss them.
1. Job fetching
SQL Server implementation provides full atomicity of this step with using UPDATE statement with OUTPUT clause. During this phase, worker gets the next job and sets the fetched timestamp (that participates in invisibility timeout process). To be reentrant, it contains invisibility timeout – when the process terminates after operation invocation, the job will be available again after expiration of this timeout.
Redis implementation is not atomic itself, and consists from the following steps:
- Fetch the next job with
BLPOPRPUSHcommand, moving it to the fetched list (where aborted jobs can be found).
- Sets the fetched timestamp.
In Redis implementation, reentrancy is more complex, because it has no efficient atomic “find-and-modify” operation for lists (
watch command on a job queue is no efficient). For this implementation, there is another component, Fetched Jobs Watcher, that watches all “fetched lists” and re-queues operations when fetched timestamp was timed out.
But is the process was terminated between steps 1 and 2, than the job has no fetched timestamp set. In this case, the queued job is indeterminate state, because we can not guess about the step 2: it can be performed after some time, or never performed at all. So, we set checked timestamp and wait for the second pass. If on the second pass it has fetched timestamp set, then we ignore checked timestamp in the future. Otherwise we move the job back to its queue, if it was expired by checked timestamp.
2. Change state to Processing
Each state change is not atomic itself. This process can be configured with user code, and I don’t want to prolong transaction lifetime or retry it many times (described above). It consist of the following steps:
- Acquire a distributed lock for the job
a. Get a job state.
b. Compare actual state to expected ones, does not match – exit.
c. Change the state
- Release the lock.
Distributed lock is held there to prevent dirty reads. Since any state transition is processed within a lock, there is a guarantee that fetched job state will be always actual. Yes, this is a weak guarantee, because you can change the state between 1a and 1b manually. But you simply should not violate encapsulation and use only HangFire API to make changes.
State comparison is a compensation to provide safe state transitions on a top level. If someone (user or another process) change the state of a job during processing, everything remains under control. I.e. when user changes the state of the job that was fetched, but not moved to the Processing state yet, to Failed, then it will not be processed, and this is expected behavior.
Actual state change is implemented using so called Write-Only Transaction (to fully correspond with Redis semantics) and is atomic. Since this is the only step that actually changes the data, all state changing process can be called atomic.
3. Job performance
Job performance is an extendable process that aimed to perform a background job, i.e. call its methods with given arguments. This process is controlled by user-defined filters. And the filters itself should ensure their correctness. No other writes are made.
4. Set the result state
State transition is described earlier and is atomic operation. To ensure correctness of this operation, the state is being changed only for jobs, whose state is Processing.
5. Remove a job from queue
Job removal is atomic operation, and there by it can not lead to inconsistent data.
Points of failure
Of course, since the whole operation is non-atomic, it can lead to inconsistent state. But let’s discuss the failure points of the whole process to ensure that it has corresponding compensation for every case to fight with inconsistence. We’ll start from the end. Since all steps itself provide guarantees that they don’t corrupt data if terminated in-between, we can consider only failures that happens between them.
5-x. Job is processed and removed from the queue. This is a success point, and the whole process may be terminated gracefully at this point.
4-5. Job was processed, its state was changed to result state (succeeded or failed), but it was not removed from the queue. In this case, the job is still in the queue. After invisibility timeout expiration, it will be fetched by a worker, and the processing will be stopped at step 2, because of the state mismatch, and will go to the step 5 without performance. As a result, it will be deleted.
3-4. Job is performed, but still in “Processing” state. After invisibility timeout expiration, it will be fetched by another worker, and would pass the second step, because it contains the transition from the “Processing” state.
2-3. The state is “Processing”, and job is not performed. As in 3-4 failure point.
1-2. The state is “Enqueued”, and job was not performed. As in 3-4 failure point, because it contains the transition from the “Enqueued” state.
This state contains only one write operation – transition from Scheduled to the Enqueued state, that is atomic itself.