Strange behavior

Today we ran a production scenario of a feature using Hangfire.
the long running job basically loops through a list of contacts and sends them an email (using MailGun web service)
What we noticed while the job was running is that the total number of sent messages exceeded significantly the total number of contact. we immediately shut down the server and rebooted to stop whatever was going on.
Upon further investigation, it looks like there was duplicate emails sent a different times. After the reboot, there was two Hangfire servers showing on the control pane. After a few minutes, the duplicate server disappeared.
This is how the job looks like when it was completed (before rebooting the server)

The Startup class looks like this:

Imports Hangfire
Imports Microsoft.Owin
Imports Owin
Imports Signals
Imports Microsoft.AspNet.SignalR

<Assembly: OwinStartup(GetType(MyWebApplication.Startup))>
Namespace MyWebApplication
Public Class Startup
Public Sub Configuration(app As IAppBuilder)
Dim options = New BackgroundJobServerOptions() With { _
.Queues = New String() {“mailblastqueue”, “verifymailqueue”}, _
.ServerName = “MailBlastServer” _
}
Dim jobStorage__1 = New Hangfire.SqlServer.SqlServerStorage(“HANGFIREConnectionString”)
JobStorage.Current = jobStorage__1
app.UseHangfireServer(options, jobStorage__1)
app.UseHangfireDashboard()
KickBoxHelper.VerifyEmail.ScheduleVerification(“verify-email-job1”,500, Cron.Daily(8, 40))
KickBoxHelper.VerifyEmail.ScheduleVerification(“verify-email-job2”,500, Cron.Daily(9, 40))
KickBoxHelper.VerifyEmail.ScheduleVerification(“verify-email-job3”,500, Cron.Daily(10, 40))
KickBoxHelper.VerifyEmail.ScheduleVerification(“verify-email-job4”,500, Cron.Daily(11, 40))
KickBoxHelper.VerifyEmail.ScheduleVerification(“verify-email-job5”,500, Cron.Daily(00, 05))
QuickEmailVerificationHelper.VerifyEmail.ScheduleVerification(300, Cron.Daily(15, 00))
’ SignalR stuff
Dim idProvider = New CustomUserIdProvider()
GlobalHost.DependencyResolver.Register(GetType(IUserIdProvider), Function() idProvider)
app.MapSignalR()

    End Sub
End Class

End Namespace

Hi @Majdi_Dhissi, looks like your job takes more than 30 seconds to complete. Please see the warning on the Using SQL Server page and the Configuring the Invisibility Timeout section on that page for the details.

Please note that Hangfire has some other retry logic to ensure that every job executes at least once, so it is better to make your background methods idempotent. In your case you may either enqueue each email sending into a new background job instead of having big single background job.

Most of our jobs will complete within an hour.
So, would you recommend adding the following code to make sure jobs don’t get duplicated?

    var options = new SqlServerStorageOptions
        {
            InvisibilityTimeout = TimeSpan.FromHours(2)
        };

GlobalConfiguration.Configuration.UseSqlServerStorage("<name or connection string>", options);

we ran into the same situation again where by the end of the job (sending bulk mail), it got duplicated and another worker started sending out to the same recipients again.

I thought the whole point of using Hangfire is to schedule or run long and time consuming jobs in the background in a reliable way. If we were to queue each email on a separate background job, and loop through the whole contact list, instead of having the loop+send inside a BackgroundJob.Enqueue
That would basically lock the user interface until the job is completed (obviously it will not work because of UI timeout)

What we currently have is like this:

<Queue("mailblastqueue")> _
        <AutomaticRetry(Attempts:=0)> _
        Public Shared Function SendBulkMail(Recipients As List(Of MailRecipient), Subject As String, MessageContent As String, CampaignId As String) As String
        For each e as MailRecipient in Recipients
        ' Send code here
        Next
        End Function

Public Shared Function ScheduleBulkMail(Recipients As List(Of MailRecipient), Subject As String, MessageContent As String, CampaignId As String, SendAt As ScheduleTime) As String
            If SendAt = ScheduleTime.SendImmediately Then
                Return BackgroundJob.Enqueue(Function() SendBulkMail(Recipients, Subject, MessageContent, CampaignId))
            Else
                Return BackgroundJob.Schedule(Function() SendBulkMail(Recipients, Subject, MessageContent, CampaignId), TimeSpan.FromHours(SendAt))
            End If
        End Function

How do you recommend changing the above?

Just bringing this to your attention. Highly appreciate someone can comment or provide suggestions.
thanks,

Sorry for the delay. Why not to change the SendBulkMail function to the following:

For each e as MailRecipient in Recipients
BackgroundJob.Enqueue(Function() SendSingleMail(Subject, MessageContent, CampaignId));
Next

I.e. your SendBulkEmail background job will spawn other background jobs instead of sending them in a row.

Well, that wouldn’t change anything, is it?
The main job (sendbulkmail) will still take a long time to complete, leading to the same issue in initially described.
Can you elaborate more on how breaking one long time consuming task into micro tasks running in the background would change anything in regards to the main background job or function?

I would imagine that a portion of the micro-jobs ie sendsingleemail would complete, but the main job would take as much time to complete, leading to another process kicking in and re-executing the main job again.
Please keep in mind that sending a single email only takes a second or so (we’re sending via rest api, not smtp)

It all depends on how long SendSingleMail takes compared to queuing the single mail task.

If it takes 1/10 of the time, the SendBulkMail job will take a 1/10 of the time. (Say 6 minutes rather than 1 hour)

You could re-queue it as multiple smaller bulk jobs, say sending 1,000 messages at a time. This way you have 1,000,000 recipients your SendBulkMail job would take 1/10,000 of the time it normally would, each smaller bulk-job would take 1/1,000 of the time and if a smaller bulk job was to fail and need to be repeated, it would only re-send those 1,000 emails rather than them all. Over-all, the send should take around the same amount of time as the additional overhead would be limited.

I think you mean 30 minutes rather than 30 seconds. (I hope you do, anyway!)