Hangfire stops to fetch jobs

Hi,

We are using hangfire in production. It is a cluster of 5 servers. In 1 windows service we start 3 hangfire core’s. Each core works with 1 queue.
If the servers are very busy (CPU and bandwidth) processing jobs sometimes 1 of the queues stops processing jobs.
Not one of the servers will fetch a jobs of that queue while the other queues still process jobs within the same windows service.

If we restart 1 service all the servers start to process jobs from that specific queue again. We use Redis. We get the feeling that there is some kind of semaphore blocking the queue in redis for the servers to fetch jobs.

Can you help me what the problem can be. We already upgraded to the last hangfire and redis versions.

Herman

We had a similar problem but no solution yet. You can see the details from
http://hangfire.discourse.group/t/distrubuted-lock-on-sql-server-is-never-released/1527/2

Hi,

I thought it would be different on REDIS. but we have exact the same issue. also upgraded from 1.1 to 1.5.x and before no issues. but now problems.

Hope to hear very soon. This is a big problem!!!

Whether you use logging, are there any exceptions? Worker class, that processes background job in Enqueued state apply distributed locks only on background jobs, there are no any queue-level locks. Are other background jobs are processed immediately after a service restart? If the problem is caused by distributed locks, there should be a delay in minutes.

@hermandejager, can you send me an output of the following Redis command to learn more about locked resources when you are experiencing the issue?

KEYS *lock*

UPD. Do you have any custom filters?

Can you also create a dump file (Task Manager -> Right click on a process -> Create dump file), archive it and send it via email to support@hangfire.io or share it through Dropbox, Google Drive, etc.?

Hi,

Yes we have custom filters.
Besides that we are now changing our implementation from 1 instance per queue. to more queue’s in one instance with this code

var processes = new List
{
new Worker(“default”),
new DelayedJobScheduler(),
new RecurringJobScheduler()
};

You are discribing:
Want 3 workers listening the default queue and 7 listening the critical queue? No problem. Don’t want to use recurring job scheduler on some instances? You can do this! Just pass the processes you need:

Can you give us an example of how the code must look like with 3 queue’s

queue - workers
default 10
Critical 2
normal 5

When this works we can cut out 2 hangfire instances in our windows server.

Yes, when i reboot 1 service the specific queue processes jobs again.

Yes i will do that, when it happens again. The files are large (8Gb) will RAR it and send it.
We have a specific busy time around 10:00. normaly it happens then.

Can you send me the source code of your filters? Sometimes they cause problems. Here is the sample code for your configuration. Will wait for the dump.

var processes = new List<IBackgroundProcess>();
processes.Add(new DelayedJobScheduler());
processes.Add(new RecurringJobScheduler());

var queues = new Dictionary<string, int>
{
    { "default", 10 },
    { "critical", 2 },
    { "normal", 5 }
};

foreach (var queue in queues)
{
    for (var i = 1; i < queue.Value; i++)
    {
        processes.Add(new Worker(queue.Key));
    }
}

var properties = new Dictionary<string, object>
{
    { "Queues", queues.Keys.ToArray() },
    { "WorkerCount", queues.Values.Sum() }
};

using (new BackgroundProcessingServer(processes, properties))
{
    Console.ReadLine();
}

Hi,

We did not have this problem since we used your code of multiple queues with different workers.
Thanks so much for your help so far Sergey!