Hangfire.Pro.Redis does not work with Cluster Mode Enabled Elastic Cache

redis
Tags: #<Tag:0x00007f499edabcb0>

#1

Hi Team,
Any body using Hangfire with Elastic Cache Cluster Mode Enabled. It gets stuck after some time While Non-Cluster Mode is working fine.

Thanks,
Gheri.


#2

What version of Hangfire.Pro.Redis are you using?


#3

Hangfire.Pro.Redis == 2.5.1
Hangfire.Pro == 2.2.0
Hangfire.Core == 1.7.6(latest as of now)


#4

Looks like hash tag is not specified in RedisStorageOptions.Prefix property as written in the docs – https://docs.hangfire.io/en/latest/configuration/using-redis.html#redis-cluster-support. Am I right?


#5

Hi Team,
Thanks for Information. I am clarifying on this so that we are on same page. So I have been using {hangfire-app}: as prefix and my connection string is Cluster ConfigurationEnd Point of Elastic Cache [Cluster Mode Enabled].
I have been Analyzing on this. Some Interesting facts I found

  1. Some of Redis Clients are unsubscribed from Server (in this case hangfire servers) which are subscribed in happy working scenarios.
  2. It Occurs Intermittently(Not Always)
  3. One of way to reproduce is to recycle the app pool which is also not consistent way to reproduce this issue.

Thanks,


#6

No, it’s not a prefix for the connection string itself, it’s a prefix for all the Hangfire-related keys, and you can configure it in the following way as written in the article referenced above:

GlobalConfiguration.Configuration.UseRedisStorage(
    "localhost:6379,localhost:6380,localhost:6381",
    new RedisStorageOptions { Prefix = "{hangfire-1}:" });

Regarding unsubscriptions, could you add more details? Unfortunately seems like I don’t understand what you do mean by subscriptions. But before please ensure you are on the latest version, which is 2.5.2 at the moment.

In the latest version a bug was fixed with workers unable to fetch background jobs if the queue was initially empty when fail-over occurred and not all the cluster nodes were referenced in the connection string.


#7

Thanks for the information. I will definitely look into the latest version.
Regarding the Observations.
I have observed that whenever the job is enqueued. Hangfire publishes to Redis channel {hangfire}:queue:default:events where “default” is queue name.
In Working Scenario I can see Redis Clients are subscribed to these channels and everything is working fine.
In this Scenario, When job is enqueued Hangfire publishes to Redis Channel but there are no subscribers to this channel and hence no processing is happening in hangfire.

In Elastic Cache we have cluster configuration endpoint as well as every node endpoint.
Are you suggesting me to use all node endpoints to hangfire or are we good to use only cluster endpoint as connection string to hangfire.

If you need more information please let me know.

Thanks,


#8

Hi Team,
I am running My app with new Hangfire.Pro.Redis[2.5.2] with cluster mode Elastic Cache.
Previously it gets stuck if AppPool gets recycled but now Its been three days it is working fine.
Will update you if in future it agains gets stuck.
Many Thanks!!! @odinserj

Thanks,
Gheri.


#9

Hi @odinserj,
Thanks for solution, It is working with new version.
I want benchmark results for hangfire sql vs redis to prove the stake holders that redis is faster.
Can you please help me to find some test results that are performed on Hangfire.Pro.Redis and proves that it is much faster than sql .

Thanks,
Gheri.


#10

The simplest way is to copy-paste the sample program, run it against both Redis and SQL Server storages and compare the results. Type the following command when running the program to create a corresponding number of jobs. Please note that it will create and process background jobs in parallel.

fast 100000

#11

Hi @odinserj,
We have used Elastic Cache Redis (Cluster Mode Enabled) with Hangfire.Pro.Redis. It was running fine but now it is throwing " OOM command not allowed when used memory > ‘maxmemory’" with cluster nodes. Then we did the online horizontal scaling by adding two more nodes but it still giving same Memory errors and hangfire jobs are failing to create.

What is the best practice to prevent this errors ??

We are using below hangfire versions.
Hangfire.Core = 1.7.6
Hangfire.Pro = 2.2.1
Hangfire.Pro.Redis = 2.5.2

Thanks,
Pooja


#12

Hi @odinserj,
We have used Elastic Cache Redis (Cluster Mode Enabled) with Hangfire.Pro.Redis. It was running fine but now it is throwing " OOM command not allowed when used memory > ‘maxmemory’" with cluster nodes. Then we did the online horizontal scaling by adding two more nodes but it still giving same Memory errors and hangfire jobs are failing to create.

What is the best practice to prevent this errors ??

We are using below hangfire versions.
Hangfire.Core = 1.7.6
Hangfire.Pro = 2.2.1
Hangfire.Pro.Redis = 2.5.2

Thanks,


#13

Simple horizontal scaling doesn’t work with Hangfire, because Redis doesn’t support transactions that span multiple hash tags, because they might be on different machines. I.e. it doesn’t support distributed transactions, and they aren’t working good even in other storages even in 2019.

But even with other storage abstractions that aren’t even available currently, it will be difficult to use this feature due to the fact that Redis may lose some confirmed writes during an unexpected failover.

So now the best practices we can use it to monitor the current working set and either disable the new work to arrive when it’s soon be approached, or increase the amount of RAM if you are in a cloud environment.

Almost always there will be some more memory when expiring background jobs will be expired, and you can speed up the process by using some max memory policy other than noeviction – with Hangfire.Pro.Redis there will be no consistency violation in this case.


#14

@odinserj Thanks for reply
So one of the solution that we were thinking of is Instead of passing Cluster Endpoint as input to Redis Storage, we can pass all nodes in Redis Storage. We can monitor programatically all cluster nodes memory and remove that node which is almost full.
For example
Initially we can create storage as RedisStorage(“node1,node2,node3,node4”);
After some time we observed dynamically that node1 has reached the memory
so we remove node1 from connection string and new connection string to Redis Storage will be “node2,node3,node4”.

Also it is recommended to set max-memory policy as noeviction in hangfire documentation. so is it safe to set other options for max-memory-policy ???

Also as we are using curly braces like {hangfire-app}: as hangfire prefix. Will Still be problem in horizontal scaling as curly braces prevents transaction problems

Thanks,


#15

As I’ve said earlier, node count doesn’t make any sense, because you are required to use hash-tagged prefix (like {hangfire}:) which forces all the keys to reside in a single hash slot on a single server.

Instead you can use multiple RedisStorage instances and construct BackgroundJobClient, RecurringJobManager and other client-related (along with Dashboard UI) classes with a reference to this or that storage. But you’ll need to cover all of them with dedicated background job servers to process all the background jobs.

I’m planning to feature a new package to handle automatic load balancing with better dashboard support for multiple storages, but didn’t schedule this feature yet, so currently it should be implemented manually.

Regarding the max-memory-policy, with Hangfire.Pro.Redis 2.5.X and later it’s safe to use any option, but they may cause unexpected job retries when active distributed locks or servers were affected.


#17

Hi @odinserj,
We are getting lot of Timeout Redis Exceptions in Hangfire Code.
We are using below hangfire versions.
Hangfire.Core = 1.7.6
Hangfire.Pro = 2.2.1
Hangfire.Pro.Redis = 2.5.2

Here is Stack Trace
StackExchange.Redis.RedisTimeoutException: Timeout performing TIME, inst: 0, mgr: Inactive, err: never, queue: 40, qu: 17, qs: 23, qc: 0, wr: 1, wq: 1, in: 1167, ar: 0, clientName: Hangfire@WIN-T62NIQS1PNF, serverEndpoint: Unspecified/hangfire-cluster-0001-001.hangfire-cluster.vchgsh.euw1.cache.amazonaws.com:6379, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=153,Free=32614,Min=4,Max=32767) (Please take a look at this article for some common client-side issues that can cause timeouts: http://stackexchange.github.io/StackExchange.Redis/Timeouts)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server) at StackExchange.Redis.RedisServer.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server)
at Hangfire.Pro.Redis.RedisConnection.TryGetServerTime(DateTime& now, String& reason)
at Hangfire.Pro.Redis.RedisConnection.Heartbeat(String serverId)
at Hangfire.Server.ServerHeartbeatProcess.Execute(BackgroundProcessContext context)