Recurring jobs stop working after a while

recurring
Tags: #<Tag:0x00007f499b3693e0>

#1

Hi guys,

I know this is a known issue, but I tried a lot of things and nothing worked for me.

I am using aspnetcore 2.2 azure kubernetes on Linux with nginx. The problem is after a while the recurring job stop working.

Do you guys have any idea how to prevent recurring job stop in this scenario?

Thanks a lot


#2

What is your current setup?

  • Is the dashboard separated from the workers?
  • Do you have different workers that runs on different queues?
  • What storage provider are you using?
  • Do all your worker are on the same version of hangfire?

Have you tried adding logging to your hangfire instance (both dashboard and workers)(Documentation) to see if there was any error during your worker execution?

I’ve had this problem recently also, and my problem was that i had 2 workers that had different jobs each but didn’t have the interface of one another. The catch is when putting a job in a queue, any worker can pick any of the jobs. That would cause the recurring job to have an exception when being picked up (because the interfaces didn’t exists in that worker) and thus, it would silently fail (if the job fail when being picked up, Hangfire puts it as an invalid job in the database and stops it from being ran at the interval set) and show after a while that the “Next execution” should be “a day ago”, which clearly doesn’t make sense.

The solution to my problem was to have an interface project that would contain ALL the interfaces of ALL the jobs. That way any workers could pick up jobs from the database and put them in the right queue.


#3
  • Is the dashboard separated from the workers?
    R: I am using docker with several nodes and different applications sharing the same sql server database.

  • Do you have different workers that runs on different queues?
    R: Yes, each application (micro-service) is using one queue (sharing the same database)

  • What storage provider are you using?
    R: Azure Sql server

  • Do all your worker are on the same version of hangfire?
    R: Yes

Have you tried adding logging to your hangfire instance (both dashboard and workers)(Documentation) to see if there was any error during your worker execution?
No, but thanks I will check and post here later

I’ve had this problem recently also, and my problem was that i had 2 workers that had different jobs each but didn’t have the interface of one another. The catch is when putting a job in a queue, any worker can pick any of the jobs. That would cause the recurring job to have an exception when being picked up (because the interfaces didn’t exists in that worker) and thus, it would silently fail (if the job fail when being picked up, Hangfire puts it as an invalid job in the database and stops it from being ran at the interval set) and show after a while that the “Next execution” should be “a day ago”, which clearly doesn’t make sense.
R: Uhm I think that is my issue, I am using docker, and for each queue (app) I have 2 nodes :\

The solution to my problem was to have an interface project that would contain ALL the interfaces of ALL the jobs. That way any workers could pick up jobs from the database and put them in the right queue.
R: Sorry, I dindt get that. But even in my case using docker do you think would work?

Thanks a lot for your help


#4

I am also using docker and the issue isn’t coming from there.

If you go in your hangfire database and check the “Set” table, you should see your recurring job. If you see that the “score” is at -1, it means that it’s not going to get picked up by your workers and that’s your problem. The score is put at -1 when there’s an error while queuing the job (i think).

You’ll have to create a csproj containing all the interfaces of your jobs. Then import that csproj into all your worker projects and use the interfaces to enqueue the jobs. That way, every worker “knows” the jobs of the other workers and they will be able to properly put them in a queue.

example of a recurring job flow : The job starts in the database. Then it can be picked up by ANY worker from the database and put into a queue. Then only worker of that queue process that job.

The problems comes because your worker doesn’t have the assembly required to properly enqueue the job of the other worker. (hangfire search the job by reflection) That’s also why the job stops working but randomly. If the job is picked up by the right worker, it has the assembly and can properly queue it but in the case where your job en up in the wrong worker, it’ll just silently fail it and never requeue it.


#5

Hi Steven,

Thanks, that helps a lot. I have only one more doubt, but using docker with two or more nodes, hangfire still will work property? Because the jobs will be in the same queue and my concern is that hangfire can duplicate the execution. Right?


#6

Hangfire locks the job when starting processing a job, So it won’t be executed by every worker.

Just be careful for when a job takes more than 30 minutes as sometimes the lock falls off and another worker will start processing the same job. you can check your storage options and increase that timeout if you need.


#7

Hi Steven,

I set up the logging using Azure Application Insights and I did not get any errors. I am wondering if I need to set up something on my server, nginx, docker etc.

Now I am not sure if it is an error or just a configuration to let the application “Always running”

Did you face something like that?