Hangfire batches fail on creating very large batches

Here’s a typical scenario - select about 100,000 rows from my database (say user IDs), and create a Batch Job, that includes running a job for each row.

On SQL Server (SQL Azure), if I go past ~20K jobs, batch job creation fails. What I think happens is for each BatchJob.Enqueue, 1 statement is run, which is fine. But when all the jobs are created, Hangfire does some kind of “activation” for the batch, which takes WAY too long - my guess is its trying to run 1 more statement for each job, and there’s a transaction timeout that happens here (see stack trace below).

Of course, I can increase the transaction timeout, but SQL Azure caps it at 10 mins, so that’s not a feasible option. I’m having to annoying break my batch jobs in multiple jobs, just to get this simple workflow in place.

Does anyone else using Hangfire.Pro have this issue? Smells like bad architecture to me - once all the jobs are created, the final part of the batch job should ideally just be 1 statement. Seems like something very convoluted is going on.

Hangfire.CreateBatchFailedException: An exception occurred while creating the batch. See the inner exception for details. ---> System.InvalidOperationException: The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. at System.Data.SqlClient.TdsParser.TdsExecuteRPC(SqlCommand cmd, _SqlRPC[] rpcArray, Int32 timeout, Boolean inSchema, SqlNotificationRequest notificationRequest, TdsParserStateObject stateObj, Boolean isCommandProc, Boolean sync, TaskCompletionSource1 completion, Int32 startRpc, Int32 startParam)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource1 completion, Int32 timeout, Task& task, Boolean asyncWrite) at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(TaskCompletionSource1 completion, String methodName, Boolean sendToPipe, Int32 timeout, Boolean asyncWrite)
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command)
at Hangfire.SqlServer.SqlServerWriteOnlyTransaction.<>c__DisplayClass23_0.b__0(DbConnection connection, DbTransaction transaction)
at Hangfire.SqlServer.SqlServerWriteOnlyTransaction.b__5_0(DbConnection connection, DbTransaction transaction)
at Hangfire.SqlServer.SqlServerStorage.<>c__DisplayClass19_0.b__0(DbConnection connection, DbTransaction transaction)
at Hangfire.SqlServer.SqlServerStorage.<>c__DisplayClass20_01.<UseTransaction>b__0(DbConnection connection) at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](Func2 func)
at Hangfire.SqlServer.SqlServerStorage.UseTransaction[T](Func3 func, Nullable1 isolationLevel)
at Hangfire.SqlServer.SqlServerStorage.UseTransaction(Action2 action) at Hangfire.SqlServer.SqlServerWriteOnlyTransaction.Commit() at Hangfire.Batches.Client.BatchFactory.Create(BatchCreateContext context) at Hangfire.BatchJobClient.Create(Action1 createAction, IBatchState state, String description)
— End of inner exception stack trace —
at Hangfire.BatchJobClient.Create(Action1 createAction, IBatchState state, String description) at Hangfire.BatchJobClientExtensions.StartNew(IBatchJobClient client, Action1 createAction, String description)
at Hangfire.BatchJob.StartNew(Action1 createAction, String description)

What pricing tier are you using for your SQL Azure database? Have you tried to increase the timeout in code?

P4 Pricing Tier.

Yes, as I mentioned, I can’t raise transaction timeout to greater than 10 minutes on SQL Azure. More importantly, the DB load shoots up when batches are being committed. What happens during the final step of batch creation? Is there a way we can avoid it?

Happy to contribute code, if you can point me the right direction …

There are two phases of a batch creation: create all the background jobs, and move them to their desired state. The last step is held within a transaction, and this is the step that causes the timeout, because K * N queries are running, where K is the number of queries required to make a transition for 1 background job, and N is the total number of background jobs in a batch. Moreover, there are 2 * K * N network delays, one for request, and one for response here.

This the price for Hangfire extensibility, and this can’t be optimised in a simple way, when SQL Server is used as a background job storage. Have you considered to use Redis for such a big workload? Its transactions work much better with fewer delays, as it’s flushing all the commands as a single request when using transactions.

Thanks, that’s what I feared.

No - I don’t want to use Redis, as I want my Hangfire jobs and batches to participate in my application’s transactions, and some of my jobs are scheduled months into the future, and Redis isn’t a reliable option for persistent storage.

I’ll go through the source and see if I can find a way around this.

Sorry, there’s no silver bullet here for such a high workload. Creation time always depends on the number of background jobs, and SQL Azure Database is much slower than an on-premise one even in premium tiers. Of course, there may be less queries in Hangfire, but their reduction will not make the creation faster in order of magnitudes. So other storage options are needed here.

Regarding Redis, a lot of people using it for background processing. I know that popularity of a solution != reliability, but what do you mean when calling it isn’t reliable option?

There are no problems with existing jobs during the transition – you can have two servers, one that processes background jobs from the old storage, and one that processes the new ones.

I understand the price of extensibility.

I have a rather unique use-case, here are my reasons to not use Redis:

  1. I don’t have a very high real-time load to need Redis
  2. I want my job creation to participate in my application’s SQL Server transactions – this is really the biggest reason I use Hangfire. If my app’s transaction is disposed, no jobs are created. I understand I can do this with Redis by restructuring my app code, but that further increases complexity
  3. I sometimes schedule jobs months into the future. Now Redis isn’t by default considered a highly-reliable store, it’s somewhat more reliable than memcached, and I can take it quite far using append-only logs and whatnot, but it’s not a place I’m comfortable storing critical bits of my application (I’ve done this in the past and have gotten burnt – especially if your dataset is LARGE, and you need to restore this after a Redis restart, the start-up takes way too long, since everything has to be loaded in memory).

Anyways my current plan is to remove the extensibility limitation, so I can have a more SQL-oriented storage, which fits just my use case. I also want to add some more control to my batch jobs, like stop/pause/delete/override queue names etc.

Huge fan of Hangfire, keep up the good work!

If you already use transactions, then creation of thousands of background jobs are already atomic, and don’t require batches unless you are using batch continuations. What is your use case?

@tejasxp , a lot of new features were added to batches in 2.0.0-beta2 (ETA for release is April), and now it’s possible to nest batches. With this feature in mind, you can break down a batch to 1 parent that consist of 1M jobs to 1000 batches, each consists of 1000 background jobs. This will not decrease the creation time, because the number of queries was even increased a bit, but state transitions will be much quicker.

BatchJob.StartNew(batch =>
{
    Parallel.For(0, 1000, i =>
    {
        batch.StartNew(inner =>
        {
            Parallel.For(0, 1000, j =>
            {
                inner.Enqueue(() => Empty());
            });
        });
    });
});

Alternatively you can now attach jobs or batches to an existing batch by using the BatchJob.Attach method and create batch incrementally:

BatchJob.Attach(batchId, batch => 
{
    foreach (var record in records)
    {
        batch.Enqueue(() => ProcessRecord(ListId, record.Id)));
    }
});

Great, looking forward to this release. Thank you.

@odinserj in your second example how do you also set an AwaitBatch. It seems that if you init an empty batch, then assign an await batch, then populate the batch, the await batch fires right away.

The trick is to create a batch with a single job first. And that job will perform the Batch.Attach method. In this case, antecedent batch will be waiting for that job to be completed:

var id = BatchJob.StartNew(batch => batch.Enqueue(() => AttachJobs(batch.BatchId, campaignId))));
BatchJob.AwaitBatch(id, batch => batch.Enqueue(() => SendNotification(campaignId)));

public void AttachJobs(string batchId, long campaignId)
{    
    var newsletterIds = Database.GetNewslettedIds(campaignId);

    BatchJob.Attach(batchId, batch =>
    {
        for (var i = 0; i < newsletterIds.Count; i++)
        {
            batch.Enqueue(() => ProcessNewsletter(campaignId, newsletterId));
        }
    });
}

But please note that the AttachJobs will be performed at least once, meaning there may be retries. So all of your inner jobs (ProcessNewsletter) should be idempotent for the same campaignId and newsletterId combinations.