Hangfire.Pro.Redis 2.8.9 dashboard failed very slow to load

Hi,

We have an issue where listing failed jobs on the /hangfire/jobs/failed page is extremely slow and so much so that asking to list 20 instead of 10 causes a http request timeout.

We are running Hangfire 1.7.23, Hangfire.Pro 2.3.0, Hangfire.Pro.Redis 2.8.9 on .NET 5.0 and .NET 6.0 preview 5 (both stacks have been tested independently and present the same issues) with a redis cluster on AWS ElastiCache.
It runs a master and 2 read replicas, they do not show anything abnormal in their monitoring other than the following.
When I go to access the failed jobs page I can see it triggers a bunch of cache misses in the redis server.

Could that be the reason it is slow?

The redis server maintains that it responds within 10 microseconds to all requests but if the hangfire dashboard has a timeout or some retry logic looking for records that no longer exist that may be causing the slow down?

Hi Matt. Are you running on Windows or Linux/macOS? Also, where your master and read replicas reside – in the same region as your app/other nodes or in different regions for disaster recovery reasons?

Hi odinserj,

The machines running the dashboard are windows environments on AWS Elastic Beanstalk and the redis machines are linux based via the AWS ElastiCache solution.

All environments are runnig in eu-west-1 (Ireland).

I should also mention that using the dashboard to look at enqueued, processing, succeeded and deleted is very fast still.
The only page that slows waaaay down is the failed overview and loading single failed job pages.

But I should say that this slowdown is recent. Started happening within the last couple of months and started simoultaneously and independently on our development and production environment leading me to believe that a package update might be the cause.

Thanks for the information. Can you post here the example of a stack trace from some job in the failed state? I’d like to see its size to understand whether it can cause the issues – however recent Hangfire.Core versions are able to truncate long stack traces to avoid this issue.

Another possible reason is IOCP thread contention if you are using System.Net.Http.HttpClient package. Some months ago there was an update which caused IOCP threads to leak and execute user code, so it is possible to block them when responses are large. I’ve just released Hangfire.Pro.Redis 2.8.10 to always use dedicated threads, even on Windows to avoid such an issue.

Please try to upgrade to the newest version and let me know if the issue persists – in this case it will be useful to use the stdump utility while trying to get 50 failed jobs to understand what threads are doing in your application and identify the slowdown.

This is an example of a page that loads extremely slowly.

The stacktrace that appears 3 times is this:

RechargeSharp.Entities.Exceptions.NotFoundException

{"errors":"Not Found"}

RechargeSharp.Entities.Exceptions.NotFoundException: {"errors":"Not Found"}

   at RechargeSharp.Utilities.ExceptionUtility.ThrowIfSuitableExceptionFound(HttpResponseMessage responseMessage) in RechargeSharp.dll:token 0x6000005+0x1a5
   at RechargeSharp.Services.RechargeSharpService.HandleHttpResponseMessage(HttpResponseMessage httpResponseMessage) in RechargeSharp.dll:token 0x6000012+0x17
   at System.Linq.Enumerable.Any[TSource](IEnumerable`1 source, Func`2 predicate) in System.Linq.dll:token 0x6000044+0x47
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates`1 shouldRetryResultPredicates, Func`5 onRetryAsync, Int32 permittedRetryCount, IEnumerable`1 sleepDurationsEnumerable, Func`4 sleepDurationProvider, Boolean continueOnCapturedContext) in Polly.dll:token 0x6000372+0x0
   at Polly.AsyncPolicy`1.ExecuteAsync(Func`3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext) in Polly.dll:token 0x600003a+0x12b
   at RechargeSharp.Services.Customers.CustomerService.CustomerExistsAsync(Int64 id) in RechargeSharp.dll:token 0x6000074+0x82
   at Vuffeli.Net.Worker.Library.Handlers.CampaignMonitorListHandler.UpdateNewCustomersAsync(Int64 customerId) in Vuffeli.Net.Worker.Library.dll:token 0x60001c9+0xed
   at Vuffeli.Net.Worker.Library.Handlers.CampaignMonitorListHandler.UpdateNewCustomersAsync(Int64 customerId) in Vuffeli.Net.Worker.Library.dll:token 0x60001c9+0xc94
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) in System.Private.CoreLib.dll:token 0x6004adb+0x28

Loading the page took 15.04 seconds, if you want I can send you a HAR file where you can see everything in the http request.

Also I can’t see the new Hangfire.Pro.Redis package yet but when I can I will try it and let you know if it solves the issue :slight_smile:

So stack trace is not so huge, so it worth to try the new package (it’s already on the private NuGet feed), or specify preferIOCP=false option in your connection string to look like redismachine:6379,preferIOCP=false.

Got it, thanks, and just forgot to add the feed to Visual Studio 2022 preview 1 haha.
Will try and let you know how it goes :slight_smile:

Tried it now, same issue :sweat_smile:

So we need to understand what’s going on inside the process – by running stdump shown below or get a profiler trace to understand what methods block everything.

I will have to do it tonight as I am out of time for now, but I’ll run it and let you know what the result is :slight_smile:

1 Like

Ok sorry about the wait, I finally got around to running it, I have a dump while loading the page and one just after the page finished loading.

This is while loading the page:

password for it is:
j9tsdrUQMR

This is just after loading the page:

password for it is:
g1cwrrT5xT

No worries @mattgenious, thanks for sharing the stack traces. Unfortunately everything looks healthy on both dumps – response reads are performed in dedicated threads (so IOCP thread blocks can’t affect your case), those threads are waiting for the response from Redis (e.g. the problem is not in response parsing logic), and there are idle worker threads in CLR’s thread pool (so dashboard requests can be handled).

But the main thing – there’s no active thread that’s rendering a dashboard page and waits for some response in Hangfire’s Storage API. So there are the following reasons for such an absence:

  1. Stack trace dump was taken when that dashboard request was already finished (or not started yet).
  2. Delay happens outside of Hangfire, in other middleware that’s registered globally in your application to participate in every request. In this case it will not be shown in the stack trace, because awaiting on asynchronous tasks doesn’t consume a thread and will be invisible for STDump.

So I’m afraid that this issue can be investigated only in dynamic, e.g. by using some sort of profiler, like dotTrace or Windows Performance Recorder.