Multiple server issues with MSMQ & MSDTC in 1.5.0-beta1

Tags: #<Tag:0x00007f0662eb4550> #<Tag:0x00007f0662eb4488>

As mentioned in a twitter exchange with @odinserj, I’ve been having issues with MSMQ, MS DTC in HF 1.5.0-beta1.

I currently have a process that spawns thousands of jobs (much like a batch job) which I would ideally like to process on multiple servers. In my testing environment I have 4 Windows Server 2008 R2 servers running our ASP.NET application with HF 1.5.0-beta1 installed.

1 of the servers has a public queue called application-processname. processname is the name of the queue used in Hangfire.

Our application has the queue path configurable by the web.config and it looks like this
FormatName:DIRECT=OS:server-1\application-{0}

We initialize MSMQ in our code as below (where _allQueues is just a list of queue names):

            //create the sql server storage and use MSMQ queuing
            var sqlServerStorage = new SqlServerStorage(DatabaseManager.CreateConnectionString());
            JobStorage.Current = sqlServerStorage.UseMsmqQueues(MsmqTransactionType.Dtc, messageQueuePath, _allQueues.ToArray());

Once deployed to the web servers, I had a lot of trouble getting the three servers without the queue connecting to the queue. I eventually set Full Control permissions on the queue to Everyone and to Anonymous User as well as the machines themselves. Once I managed to get this working, triggering the process started to raise a number of exceptions.

2 of the non-queue servers would not participate in the work at all, writing this exception to the log:

2015-07-29 00:41:46.6966 UTC | 2015-07-29 10:41:46.41 +10:00 Server | Error | Hangfire.Server.Worker | IIS APPPOOL\AppPoolName | Error occurred during execution of 'Worker #2' component. Execution will be retried (attempt 7 of 2147483647) in 00:00:49 seconds.
   System.Messaging.MessageQueueException (0x80004005): Cannot import the transaction.
   at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 action, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction internalTransaction, MessageQueueTransactionType transactionType)
   at System.Messaging.MessageQueue.Receive(TimeSpan timeout, MessageQueueTransactionType transactionType)
   at Hangfire.SqlServer.Msmq.MsmqDtcTransaction.Receive(MessageQueue queue, TimeSpan timeout)
   at Hangfire.SqlServer.Msmq.MsmqJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

Doing investigation, this indicates that there are issues connecting to MS DTC (Distributed Transaction Controller).
I have verified that all have the exact same configuration for MSDTC network access as well as firewall access. (See bottom of post for pics)

On the two servers (queue host and the last non-queue host) that do participate in the work, I get this exception:

2015-07-29 00:39:37.9718 UTC | 2015-07-29 10:39:37.39 +10:00 Server | Info | Hangfire.Server.Worker | IIS APPPOOL\AppPoolName | Error occurred during execution of 'Worker #1' component. Execution will be retried (attempt 2 of 2147483647) in 00:00:04 seconds.
   System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.TimeoutException: Transaction Timeout
   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStatePromotedAborted.BeginCommit(InternalTransaction tx, Boolean asyncCommit, AsyncCallback asyncCallback, Object asyncState)
   at System.Transactions.CommittableTransaction.Commit()
   at System.Transactions.TransactionScope.InternalDispose()
   at System.Transactions.TransactionScope.Dispose()
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

Which is very strange as I’m not seeing any exceptions occurring in my process. Another strange thing is that I’m seeing a job start processing on the same server twice within a minute. It then ‘succeeds’ 10 seconds later.
This is the result of a SELECT * FROM Hangfire.State WHERE JobId = 3, where job 3 is marked as Succeeded.

8         3    Enqueued    NULL    2015-07-29 00:35:31.260    {"EnqueuedAt":"2015-07-29T00:35:31.1494219Z","Queue":"modelmigrations"}
11       3    Processing    NULL    2015-07-29 00:35:31.967    {"StartedAt":"2015-07-29T00:35:31.9632801Z","ServerId":"server-1:888:a3af8fc6-522c-4f84-85d4-7ededaf87a61","WorkerNumber":"1"}
5819    3    Processing    NULL    2015-07-29 00:36:51.613    {"StartedAt":"2015-07-29T00:36:51.6082438Z","ServerId":"server-1:888:a3af8fc6-522c-4f84-85d4-7ededaf87a61","WorkerNumber":"2"}
6572    3    Succeeded    NULL    2015-07-29 00:37:00.680    {"SucceededAt":"2015-07-29T00:37:00.6751504Z","PerformanceDuration":"88698","Latency":"745"}

I anticipate each job taking approximately 5 minutes, so there is clearly some issue going on.

I hope this information is helpful. This is obviously an issue for us so happy to test out things to try and get it working.

MS DTC Configuration (exactly the same for each server)

Services

Windows Firewall Outbound

Windows Firewall Inbound

Component Services -> Local DTC Properties

1 Like

Wonderful description! Looks like you configured all the services well, I had the same configuration when I’ve tested the remote queues support. I’ve looked StackOverflow, here are some tips:

  • Please ensure Distributed Transaction Coordinator service is running on all the machines (stupid tip, yes :smile:).
  • Make sure, both machine’s clocks are in sync. I seen this before where authentication will fail because the server and the client are off time by a minute. This will happen even if the queues are public and have permissions on everyone (SO answer).
  • Let ICMP traffic through the firewall as well (SO answer).

Have you tried to run DTCPing diagnostic tool (you need to run in on both servers)?

Which is very strange as I’m not seeing any exceptions occurring in my process.

Do you mean that you are using MSMQ from another place in your application?

Thanks for the tips, I didn’t know about DTCPing.

It turns out that the VM’s i’ve been testing have been incorrectly cloned, meaning that they have duplicate CID’s for MSDTC..

I’ve reinstalled MSDTC on the two VM’s that don’t work and I’m now getting another error :frowning:.

2015-07-31 01:47:26.8494 UTC | 2015-07-31 11:47:26.47 +10:00 Server | Error | Hangfire.Server.Worker | IIS APPPOOL\AppPoolName | Error occurred during execution of ‘Worker #1’ component. Execution will be retried (attempt 12 of 2147483647) in 00:02:05 seconds.
System.Messaging.MessageQueueException (0x80004005): Cannot enlist the transaction.
at System.Messaging.MessageQueue.ReceiveCurrent(TimeSpan timeout, Int32 action, CursorHandle cursor, MessagePropertyFilter filter, MessageQueueTransaction internalTransaction, MessageQueueTransactionType transactionType)
at System.Messaging.MessageQueue.Receive(TimeSpan timeout, MessageQueueTransactionType transactionType)
at Hangfire.SqlServer.Msmq.MsmqDtcTransaction.Receive(MessageQueue queue, TimeSpan timeout)
at Hangfire.SqlServer.Msmq.MsmqJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.AutomaticRetryProcess.Execute(BackgroundProcessContext context)

I’ll keep trying things out and we’ll see how you go.

Nah, I meant that I wasn’t seeing any exceptions, which might have been forcing the transaction abort.

FYI: I’m away for a couple of weeks so I won’t be able to do any testing for a while.

Wow, nice investigation. What shows the DTCPing utility? Does it enlist the transaction correctly?

This is an anonymized output of DTCPing. The output from all servers appear to be the same. No obvious errors.

IP Configure Information
Host Name . . . . . . . . . : SERVER_1
DNS Servers . . . . . . . . : X.X.X.X
X.X.X.X
Node Type . . . . . . . . . : Hybrid
NetBIOS Scope ID. . . . . . :
IP Routing Enabled. . . . . : no
WINS Proxy Enabled. . . . . : no
NetBIOS Resolution Uses DNS : no

Ethernet adapter {GUID}:

Description . . . . . . . . : Intel(R) PRO/1000 MT Network Connection
Physical Address. . . . . . : 00-00-00-00-00-00
DHCP Enabled. . . . . . . . : yes
IP Address. . . . . . . . . : X.X.X.X
Subnet Mask . . . . . . . . : X.X.X.X
Default Gateway . . . . . . : X.X.X.X
DHCP Server . . . . . . . . : X.X.X.X
Primary WINS Server . . . . : 
Secondary WINS Server . . . : 
Lease Obtained. . . . . . . : Sun Aug 16 23:20:57 2015
Lease Expires . . . . . . . : Mon Aug 17 07:20:57 2015

++++++++++++lmhosts.sam++++++++++++

++++++++++++hosts ++++++++++++

08-17, 11:57:59.453–>Error(0x424) at clutil.cpp @256
08-17, 11:57:59.453–>-->OpenCluster
08-17, 11:57:59.454–>-->1060(The specified service does not exist as an installed service.)
++++++++++++++++++++++++++++++++++++++++++++++
DTCping 1.9 Report for SERVER_1
++++++++++++++++++++++++++++++++++++++++++++++
RPC server is ready
++++++++++++Validating Remote Computer Name++++++++++++
08-17, 11:58:10.494–>Start DTC connection test
Name Resolution:
SERVER_2–>Y.Y.Y.Y–>SERVER_2.company.corp
08-17, 11:58:10.562–>Start RPC test (SERVER_1–>SERVER_2)
RPC test is successful
Partner’s CID: GUID
++++++++++++RPC test completed+++++++++++++++
08-17, 11:58:23.439–>RPC server:SERVER_1 received following information:
Network Name: SERVER_1
Source Port: 50184
Partner LOG: SERVER_24796.log
Partner CID: GUID
++++++++++++Start Reverse Bind Test+++++++++++++
Received Bind call from SERVER_2
Network Name: SERVER_1
Source Port: 50184
Hosting Machine:SERVER_1
08-17, 11:58:23.542–>Trying to Reverse Bind to SERVER_2…
Test Guid:GUID
Name Resolution:
SERVER_2–>Y.Y.Y.Y–>SERVER_2.company.corp
Reverse Binding success: SERVER_1–>SERVER_2
++++++++++++Reverse Bind Test ENDED++++++++++
08-17, 11:58:23.651–>Called POKE from Partner:SERVER_2
Network Name: RSERVER_1
Source Port: 50184
Hosting Machine:SERVER_1
++++++++++++Start DTC Binding Test +++++++++++++
Trying Bind to SERVER_2
08-17, 11:58:23.653–>SERVER_1 Initiating DTC Binding Test…
Test Guid:GUID
Received reverse bind call from SERVER_2
Network Name: SERVER_1
Source Port: 50184
Hosting Machine:SERVER_1
Binding success: SERVER_1–>2
++++++++++++DTC Binding Test END+++++++++++++