SCOM Gateway communication errors

Want to share with you a problem I have been working on and managed to finally solve. The problem is actually a misconfiguration rather than technical, but I feel that I have learned a lot while working on this issue.

The environment here is that I have several domains in several forests. The Management Servers (5 in total) are hosted DomainA in ForestA and the Gateway servers are hosted in DomainB in ForestA. External forest trust with Selective authentication exists between the domains.

DomainB hosts 2 Gateway servers which should be working together – sharing load and working as a standby for one another. One of the GW server’s stopped responding and was constantly throwing the following errors:

EventID: 20057

Failed to initialize security context for target MSOMHSvc/MS1.DomainA.local The error returned is 0x80090303(No authority could be contacted for authentication.).  This error can apply to either the Kerberos or the SChannel package.

EventID: 21001

The OpsMgr Connector could not connect to MSOMHSvc/MS1.DomainA.local because mutual authentication failed. Verify the SPN is properly registered on the server and that, if the server is in a separate domain, there is a full-trust relationship between the two domains.

EventID: 20071

The OpsMgr Connector connected to MS1.DomainA.local, but the connection was closed immediately without authentication taking place.  The most likely cause of this error is a failure to authenticate either this agent or the server.  Check the event log on the server and on the agent for events which indicate a failure to authenticate.

EventID: 21016

OpsMgr was unable to set up a communications channel to MS1.DomainA.local and there are no failover hosts.  Communication will resume when opsmgr.company.com is available and communication from this computer is allowed.

At the beginning it seemed like there was some issue with the SPN’s and in fact the SPN’s were misconfigured. I used Kevin’s post to reconfigure the SPN’s properly, but that did not solve the problem. I then stopped and thought about it and figured out that SPN could not be the issue since the GW servers are hosted in a separated domain – the authentication has to work over Certificate and not Kerberos.

This led me to checking the certificates validity period on both ends (which was fine) and that they are installed. The best way to check that a certificate is installed is by checking the registry:

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Machine Settings

The first thing I’ve noticed there is that MS1.DomainA.Local was missing the ChannelCertificateHash key – meaning there was no certificate installed. The certificate existed in the local certificate store but was not imported into SCOM. Using MoMCertImport tool I’ve imported the proper certificate which immediately solved the problem and GW started communicating with its MS.

What do you think about this post?