Hi. We’ve been running Sitecore 9.02 and Hangfire for about 3 years in production.
The infrastructure is hosted in Azure - Webapps and Azure SQL.
The setup has been more or less stable the entire time. We’ere processing around 200k-500k jobs every day.
We’ve recently encountered a huge issue, we’re the server completely hangs on startup - shutdown/startup is done at least once during deploy. This behaviour seems to happen when we have over 100K jobs enqueued. No amount of restarts will fix it. I can only get the server up again if the hangfire-database is truncated or if I create a new SQL-instance.
The .NET profiler reveals some Hangfire threads being in Wait.One state. I assume those are expected and come from the workers waiting for news jobs to process.
We also have instances where then environment is running fine for a couple of days. When I come back to it we might have 2-3 mio jobs enqueued and a number of jobs in the processing state - where they might have been in the last hour, even though this particular job only takes 300ms to run.
We have looked at Server and SQL utilization and run the various diagnostics tools in Azure. None of them seem to reveal the issue. It doesn’t make any difference scaling the infrastructure.
To me it seems that the infrastructure have problems shutting down/releasing cpu-threads or something. Something is not being shutdown gracefully so to speak.
Hangfire is configured in the owin initialize pipeline. See attached screenshot. Its not “pure” owin as Sitecore has a layer on top of it.
I’m wondering if it’s this particular way of initializing the system and the shutdown-sequence, that are giving us issues.
In other words: It could be that the combination of Sitecore and Hangfire is the issue (not Hangfire in itself).
Do any of you guys have experience with this combination of Sitecore, Hangfire and Azure? Do you have any tips where we should be careful to configure Hangfire in a certain way?