Handling long running tasks (+ long invisibility timeout) + server restarts

Hi -

Firstly, as everyone keeps saying, Hangfire is awesome! I’m just running into an issue configuring my set-up.

We have potentially long-running tasks, say 30min-1.5 hours. We throw these onto hangfire, and all works well for the most part. However, imagine this scenario:

Config:

Invisibility timeout set to 90 mins, as tasks can take up to 90 minutes.
Hangfire server running in console application

  1. Long running task created and launched

  2. console application is restarted, so we call:

    server.Stop();
    server.Dispose();

  3. Then we re-launch the server:

    server.Start();

However, the original job is still running on the old server. Hangfire shows:

Looks like the job was aborted – it is processed by server ip-0ac4cbb1:11540, which reported its heartbeat more than 1 minute ago. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.

It then shows, after further delay:

The job was aborted – it is processed by server ip-0ac4cbb1:11540 which is not in the active servers list for now. It will be retried automatically after invisibility timeout, but you can also re-queue or delete it manually.

The invisibility timeout is set to 90 minutes, as some jobs might take this long. However, shouldn’t the server dying cause the job to be re-queued? Is there a way to get this behaviour?

If alternatively I reduce the invisibility timeout, the job could be launched multiple times - which isn’t ideal. If I add a global lock, and just return success on subsequent attempts - this could cause an issue if the console app is relaunched after the invisibiliy timeout (the failed job will never be captured).

A little unsure how I’m meant to handle this - help appreciated!

Thanks again for all your great work!
Steven.

Suggested fix for this behaviour, add a shut-down parameter so I can call:

server.Stop(requeueAllRunningJobs=true);

This will tell Hangfire the server is dead, and to re-queue all the jobs that weren’t completed at last heartbeat. Then you get the normal behaviour if the server was to die / go offline, but you get the added benefit that for deploys / restarts the lag due to the invisibilitytimeout would no longer be an issue and the jobs would just “migrate” onto the new worker.

Any thoughts on this? The alternative is to use the heartbeat, but if the worker is at 100% and unresponsive that may not be possible. I’m assuming that is the reason for the current design anyway.

@odinserj - do you have a view on this?

I can’t find a section in the documentation on how to handle new deploys to the server, so not sure what people are doing to mitigate this currently?

Any help appreciated!

@stevensk, sorry for the long delay. Are you using cancellation tokens?

Nope - I was trying to avoid modifying my code drastically, and given the process dies, I might not be able to react to the cancellation token in the event of a long running calculation.

AppHarbor (where this is running) only allows for 5 seconds to clean up when a background worker is being killed (due to deployment).

HangFire seems to know the worker is dead now (thanks for the awesome UI updates!), but is there a way to tell HangFire to requeue all jobs assigned to the now dead worker?

Thanks!

@odinserj - to sum up:

  • Appharbor kills the worker so fast capturing and processing the cancellation token may not be possible (we have long sections of code that takes longer than 5 seconds at 100% CPU).
  • Task based cancellation token not feasible given the above
  • Have single process (with CPU resource) dedicated to listening to new deploys, and to tell Hangfire it is shutting down
  • Deployments happening 10+ times per day, meaning we need to handle them gracefully.

Should I re-architect such that each task is split up to tiny tasks, using the new Continuations features?

Or do you have a different solution, using the server state in mind / soon to be released?

I’m imagining Hangfire wasn’t initially designed to work with long-running tasks such as ours, but would obviously hugely appreciate if a feature such as:

server.Stop(requeueAllRunningJobs=true);

Could be implemented into the next version.

@odinserj - do you have a view on this? Or whether it’s something that will be in your next release?

Alternatively - is there a way I can do this by accessing the database directly to “requeue” all the jobs currently running?

So it looks like the beta release has: “Instant Re-Queue for SQL Server”

Do you have any more information around this? Happy to get it up and running, just thought this might give you a platform to explain the process / move to your docs.

Thanks again, and the new update looks awesome!

Background job processing is now performed within a regular SQL Server transaction using repeatable read isolation level. The following query is run:

set transaction isolation level read committed
begin transaction

delete top (1) from HangFire.JobQueue with (readpast, updlock)
output DELETED.Id, DELETED.JobId, DELETED.Queue
where (FetchedAt is null or FetchedAt < DATEADD(second, @timeout, GETUTCDATE()))
and Queue in @queues

-- Perform a job

commit transaction

delete statement in a repeatable read isolation level will block a fetched job from fetching by other workers, output clause is here for fetching a job identifier in an atomic way without unnecessary preceding select statement. readpast table hint is there to ignore blocked rows by other transactions to allow other workers fetch another rows.

Awesome - that sounds great. Just deployed the upgrade in production…

Had to manually change the schema number in the database, something you might want to be aware of. Error as follows:

Upgrade now complete, just ran it through a test and it failed over so gracefully I shed a small tear.

Thanks for this upgrade, will be a life saver!

@stevensk, thank you for reporting this! I’ve changed the migration to include the index re-creation.

@odinserj No problem, thank you for the excellent product! I’m also seeing lots of errors where Hangfire is requesting assets that aren’t available:

Error[ArgumentException][hangfire/fonts/glyphicons-halflings-regular/woff2] : Resource with name Hangfire.Dashboard.Content.fonts.glyphicons-halflings-regular.woff2 not found in assembly Hangfire.Core, Version=1.5.0.0, Culture=neutral, PublicKeyToken=null. :    at Hangfire.Dashboard.EmbeddedResourceDispatcher.WriteResource(IOwinResponse response, Assembly assembly, String resourceName)
   at Hangfire.Dashboard.EmbeddedResourceDispatcher.WriteResponse(IOwinResponse response)
   at Hangfire.Dashboard.EmbeddedResourceDispatcher.Dispatch(RequestDispatcherContext context)
   at Hangfire.Dashboard.MiddlewareExtensions.<>c__DisplayClass6.<>c__DisplayClass8.<UseHangfireDashboard>b__4(IDictionary`2 env)
   at Microsoft.Owin.Mapping.MapMiddleware.<Invoke>d__0.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContextStage.<RunApp>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.<DoFinalWork>d__2.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.StageAsyncResult.End(IAsyncResult ar)
   at Microsoft.Owin.Host.SystemWeb.IntegratedPipeline.IntegratedPipelineContext.EndFinalWork(IAsyncResult ar)
   at System.Web.HttpApplication.AsyncEventExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
   at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)

I know this is a pretty old topic at this point, but I’ve been running into a similar issue. I wanted to throw my contribution to working around jobs which have been aborted due to server shutdown. If automatic retries are disabled (Attempts = 0), or the jobs fails due to server shutdown and is beyond the maximum number of attempts, you can run into this issue. Unfortunately for us, this was causing new jobs to not start processing until the aborted jobs were either manually deleted or re-queued.

Basically, I took the following approach to automatically handle aborted jobs: during startup and after initializing the BackgroundJobServer, I use the MonitoringApi to get all of the currently processing jobs. If there are any, I loop through each and call BackgroundJob.Requeue(jobId). Here’s the code, for reference:

var monitor = Hangfire.JobStorage.Current.GetMonitoringApi();                
if (monitor.ProcessingCount() > 0)
{                    
    foreach (var job in monitor.ProcessingJobs(0, (int)monitor.ProcessingCount()))
    {
        BackgroundJob.Requeue(job.Key);                        
    }
}