ok, I made some experiments and came to the conclusion that probably Hangfire does not support automatic retry of a suddenly fallen server’ jobs.
If a server managed to place its running jobs into retry queue, then they will be retried.
If server just quits (OOM, electricity, or container shut down) then they will remain in orphaned state with only manual resume possible.
To mitigate this problem, I added the following code into my services startup. It detects orphaned jobs (that is, jobs without active server) and requeues them. So in case one instance of a service fails, and another one starts - it takes care of these jobs.
private static void RequeueOrphanedJobs()
var api = JobStorage.Current.GetMonitoringApi();
var processingJobs = api.ProcessingJobs(0, 100);
var servers = api.Servers();
var orphanJobs = processingJobs.Where(j => !servers.Any(s => s.Name == j.Value.ServerId));
foreach (var orphanJob in orphanJobs)