I’ve written some code to provide a simplified view of current Hangfire jobs that is readily machine-readable. We use this
to monitor the jobs with Nagios.
Here’s the relevant part of that code:
using System;
using Hangfire.Storage;namespace Thingster.WebServer.Views.Home.SiteMonitorSupport
{
internal static class HangfireJobCheck
{
public const int NumberOfDaysToTolerateFailure = 2;
public static TimeSpan MaximumTimeAnyJobMayFail = new TimeSpan(NumberOfDaysToTolerateFailure,0,0,0);public static HealthCheckResult Run() { return RunThisTest(); } public static HealthCheckResult RunThisTest() { var runTime = DateTime.Now; var result = new HealthCheckResult("HangfireJob"); var hangfireStorage = Startup.GetSqlServerStorage(); using (var connection = hangfireStorage.GetConnection()) { var recurringJobs = connection.GetRecurringJobs(); foreach (var job in recurringJobs) { if (!job.LastExecution.HasValue) { result.Success = false; result.AddResultMessage(string.Format("Problem: Job ID {0} has never been run.", job.Id)); } else { TimeSpan timeSinceJobWasRun = runTime - job.LastExecution.Value; if (timeSinceJobWasRun > MaximumTimeAnyJobMayFail) { result.Success = false; result.AddResultMessage(string.Format("Problem: Job ID {0} last run {1:0.00}} hours ago, which is longer than maximum time of {2:0.00} hours.", job.Id, timeSinceJobWasRun.TotalHours, MaximumTimeAnyJobMayFail.TotalHours)); } else { var success = job.LastJobState == "Succeeded"; string problemString = !success ? " had problems" : string.Empty; result.AddResultMessage(string.Format("Job ID {0} last run {1:0.00} hours ago{3}. Last run state: {2}.", job.Id, timeSinceJobWasRun.TotalHours, job.LastJobState, problemString)); if (!success) { result.Success = false; } } } } } return result; } }
}
And while I haven’t included everything, here’s what that ends up looking like:
SiteMonitor Web Page
Check HangfireJob: Fail
Problem: Job ID Accrual Obligations has never been run.
Problem: Job ID Auto Set Task Status has never been run.
Problem: Job ID Fire Reminders has never been run.
Problem: Job ID Identify XY has never been run.
Problem: Job ID Mail Data Quality Admin has never been run.
Job ID Send Queued Messages last run 8.31 hours ago. Last run state: Succeeded.
Problem: Job ID Session Cleanup has never been run.
Problem: Job ID Status Report Statistics has never been run.
Job ID Workflow Notifications last run 8.31 hours ago. Last run state: Succeeded.
Nagios examines this page and if errors are found throws an alert.
Here’s the problem: job.LastExecution does NOT appear to include manual executions. For our tasks on this website, if a developer or devops person is alerted via Nagios that a job has failed, they will first attempt to fix the problem, and then re-run the job manually. The issue for us is that even if the job has now been re-run manually with success, our HangFire site monitor will show the job as not having run successfully. We would hope that a manual re-run would clear the error condition in Nagios.
Is there a deliberate reason that LastExecution does not show manual runs? Before I started digging around in other structures, trying to get the “real” LastExecution value we are interested in, I wanted some idea of the thinking behind the value of LastExecution.