As mentioned in a twitter exchange with @odinserj, I’ve been having issues with MSMQ, MS DTC in HF 1.5.0-beta1.
I currently have a process that spawns thousands of jobs (much like a batch job) which I would ideally like to process on multiple servers. In my testing environment I have 4 Windows Server 2008 R2 servers running our ASP.NET application with HF 1.5.0-beta1 installed.
1 of the servers has a public queue called application-processname
. processname
is the name of the queue used in Hangfire.
Our application has the queue path configurable by the web.config
and it looks like this
FormatName:DIRECT=OS:server-1\application-{0}
We initialize MSMQ in our code as below (where _allQueues
is just a list of queue names):
//create the sql server storage and use MSMQ queuing
var sqlServerStorage = new SqlServerStorage(DatabaseManager.CreateConnectionString());
JobStorage.Current = sqlServerStorage.UseMsmqQueues(MsmqTransactionType.Dtc, messageQueuePath, _allQueues.ToArray());
Once deployed to the web servers, I had a lot of trouble getting the three servers without the queue connecting to the queue. I eventually set Full Control permissions on the queue to Everyone
and to Anonymous User
as well as the machines themselves. Once I managed to get this working, triggering the process started to raise a number of exceptions.
2 of the non-queue servers would not participate in the work at all, writing this exception to the log:
2015-07-29 00:41:46.6966 UTC | 2015-07-29 10:41:46.41 +10:00 Server | Error | Hangfire.Server.Worker | IIS APPPOOL\AppPoolName | Error occurred during execution of 'Worker #2' component. Execution will be retried (attempt 7 of 2147483647) in 00:00:49 seconds.
System.Messaging.MessageQueueException (0x80004005): Cannot import the transaction.
at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 action, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction internalTransaction, MessageQueueTransactionType transactionType)
at System.Messaging.MessageQueue.Receive(TimeSpan timeout, MessageQueueTransactionType transactionType)
at Hangfire.SqlServer.Msmq.MsmqDtcTransaction.Receive(MessageQueue queue, TimeSpan timeout)
at Hangfire.SqlServer.Msmq.MsmqJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
Doing investigation, this indicates that there are issues connecting to MS DTC (Distributed Transaction Controller).
I have verified that all have the exact same configuration for MSDTC network access as well as firewall access. (See bottom of post for pics)
On the two servers (queue host and the last non-queue host) that do participate in the work, I get this exception:
2015-07-29 00:39:37.9718 UTC | 2015-07-29 10:39:37.39 +10:00 Server | Info | Hangfire.Server.Worker | IIS APPPOOL\AppPoolName | Error occurred during execution of 'Worker #1' component. Execution will be retried (attempt 2 of 2147483647) in 00:00:04 seconds.
System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.TimeoutException: Transaction Timeout
--- End of inner exception stack trace ---
at System.Transactions.TransactionStatePromotedAborted.BeginCommit(InternalTransaction tx, Boolean asyncCommit, AsyncCallback asyncCallback, Object asyncState)
at System.Transactions.CommittableTransaction.Commit()
at System.Transactions.TransactionScope.InternalDispose()
at System.Transactions.TransactionScope.Dispose()
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)
Which is very strange as I’m not seeing any exceptions occurring in my process. Another strange thing is that I’m seeing a job start processing on the same server twice within a minute. It then ‘succeeds’ 10 seconds later.
This is the result of a SELECT * FROM Hangfire.State WHERE JobId = 3
, where job 3 is marked as Succeeded.
8 3 Enqueued NULL 2015-07-29 00:35:31.260 {"EnqueuedAt":"2015-07-29T00:35:31.1494219Z","Queue":"modelmigrations"}
11 3 Processing NULL 2015-07-29 00:35:31.967 {"StartedAt":"2015-07-29T00:35:31.9632801Z","ServerId":"server-1:888:a3af8fc6-522c-4f84-85d4-7ededaf87a61","WorkerNumber":"1"}
5819 3 Processing NULL 2015-07-29 00:36:51.613 {"StartedAt":"2015-07-29T00:36:51.6082438Z","ServerId":"server-1:888:a3af8fc6-522c-4f84-85d4-7ededaf87a61","WorkerNumber":"2"}
6572 3 Succeeded NULL 2015-07-29 00:37:00.680 {"SucceededAt":"2015-07-29T00:37:00.6751504Z","PerformanceDuration":"88698","Latency":"745"}
I anticipate each job taking approximately 5 minutes, so there is clearly some issue going on.
I hope this information is helpful. This is obviously an issue for us so happy to test out things to try and get it working.
MS DTC Configuration (exactly the same for each server)
Services
Windows Firewall Outbound
Windows Firewall Inbound
Component Services → Local DTC Properties