Hangfire Discussion

Job retry while job is running with small delay


#1

Today we had a job retry - even though the job was still running.
The second attempt was only 3 minutes after the initial start - basically resulting in the same job overlapping.

On the dashboard and in the database, this looked like any other “job was interrupted, so start it again” - with only one “Enqueued” entry, several “processing” entries and only one “succeeded” entry.
However, this was not due to an application pool/server crash, and in fact, the serverid of the job was identical, only the workerid was different indicating the hangfire server was not interrupted.

All hangfire settings are default - except each IIS Site instance has its own queue and the AutomaticRetryAttribute is set to 0.
Should also add - we are just using the enqueue call to instantly execute our jobs - we aren’t doing anything fancy like scheduling jobs to execute later or execute after another job at this point.

Anyone else experienced this and/or fixed it?

As far as we’re aware this hasn’t happened before, and in fact the only reason we were aware of it this time is due to an admittedly poor legacy design choice which caused incorrect data output due to the overlapping processes.


#2

Ok - I’m going to reply to this in case someone else comes across it.

The key bit of information I had overlooked was the fact that the server id was the same, but the worker ids were separate. Many thanks to odinserj for getting in contact regarding this.

As it turns out, we have been having some minor network blips between the front end server (where hangfire executes) and the backend server (where the SQL database resides) - this then results in hangfire thinking that the job has failed and allocating a free worker to it. Meanwhile, the “hung” process reconnects and keeps going, resulting in overlapping jobs.

So, as far as I’m concerned, not a result of hangfire - it’s due to our server.