Graceful shutdown does not release invisibility on jobs that have been queried but not processed

Andrew_Borland · July 7, 2022, 10:30am

This is on hangfire 1.7.28 using the sql server storage in azure sql.

Since moving to kubernetes and autoscaling we have noticed in our hangfire service project that during scale down of pods we are receiving alerts that our processes are stalling. These alerts happen near the end of our jobs, where the cluster starts to scale down.

We process the jobs pretty as fast as we can insert them in our current logic, and when we look into this we see a single job at the point of scaledown takes just over 5 minutes from creation to be processed. It’s also very rare, we may see 1 job do this out of 100k jobs.

It looks like the server that is shutting down is picking the job up, shutting down, and not releasing the job so another server can pick it up. It eventually gets picked up after 5 minutes once the invisibility timeout expires.

What’s strange is looking at the code this appears to be handled, I see logs for 2 of our kubernetes pods stopping and logs that it stopped, there’s no warnings about jobs being requeued or requeuing failed.

Timeline of events

08:45:14.077 - Server 2 receives stopping signal
08:45:14.083 - Server 2 receives stopped signal
08:45:14.094 - Server 3 receives stopping signal
08:45:14.105 - Server 3 receives stopped signal
08:45:14.191 - Job created on server 1
08:45:14.224 - Server 2 reports all dispatchers stopped
08:45:14.233 - Server 2 successfully reported itself as stopped
08:45:14.427 - Server 3 reports all dispatchers stopped
08:45:14.437 - Server 3 successfully reported itself as stopped
08:50:14.256 - Job begins being processed on server 4