A DAG member failed from the cluster
The other day I had to deal with an odd error on one of my clients DAG server, and I want to share with you the solution for the problem, if you ever come with this error.
The scenario is this – I have 2 servers acting as Mailbox server and are a part of a DAG, and I have 2 more server acting as CAS/HAB servers. All servers located on the same site.
Everything was working well for quite some time, then suddenly one of the DAG members started throwing odd messages like:
EventID 1135:Cluster node ‘XXX’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
EventID 1049:File share witness resource ‘File Share Witness (\\CAS\quorum)’ failed to arbitrate for the file share ‘\\NAS\quorum’. Please ensure that file share ‘\\NAS\quorum’ exists and is accessible by the cluster.
EventID 1069:Cluster resource ‘File Share Witness (\\CAS\quorum)’ in clustered service or application ‘Cluster Group’ failed.
EventID 1564:File share witness resource ‘File Share Witness (\\CAS\quorum)’ failed to arbitrate for the file share ‘\\CAS\quorum’. Please ensure that file share ‘\\CAS\quorum’ exists and is accessible by the cluster.
As a result, the server stopped replicating DB’s.
I’ve verified that the server in fact can access the FWS share, that he is a member of the ‘Exchange Trusted Subsystem’ group, and that everything else is working just fine.
Everything I’ve tried did not work – restart the server, reconfigure the DAG, remove the faulting server from the DAG.
After cracking my head for a while, I’ve decided to remove the DAG entirely and reconfigure. To do so, I had to do the following:
1. Start by saving to a text file all DAG configuration – Get-DatabaseAvailabilityGroup | FL >> c:\DAG_Settings.txt
2. Remove ‘Replication Copies’ on the faulting server –Get-MailboxDatabaseCopyStatus –Server BADSERVER | Remove-MailboxDatabaseCopy
3. Remove the faulting server from the DAG – Remove-DatabaseAvailabilityGroupServer –Identity DAGNAME –Server BADSERVER –ConfigurationOnly. Notice the –ConfigurationOnly part, because without it the server will not be removed.
4. Remove the server from the cluster configuration – Get-ClusterNode BADSERVER | Remove-ClusterNode –Force
5. Clear the faulting server from the cluster configuration – Get-ClusterNode BADSERVER | Clear-ClusterNode
6. To be able to remove the entire DAG configuration, I also had to remove the working server from the DAG – Remove-DatabaseAvailabilityGroupServer –Identity DAGNAME –Server GOODSERVER –ConfigurationOnly. Again with –ConfigurationOnly switch.
7. Now I had to completely destroy the DAG- Get-DatabaseAvailabilityGroup DAGNAME | Remove-DatabaseAvailabilityGroup. Remember here that you have to remove the DAG computer account in the AD.
Once everything is done, I reconstructed the DAG:
1. New-DatabaseAvailabilityGroup -Name DAG1 -WitnessServer CAS -WitnessDirectory C:\DAG1. I used the same FWS as before (using the DAG_Settings.txt file from the previous step).
2. New-DatabaseAvailabilityGroup -Name DAG2 -DatabaseAvailabilityGroupIPAddresses 126.96.36.199. Again, Same address as before.
3. I then added the working server (the one with the mounted DB’s on) to the DAG – Add-DatabaseAvailabilityGroupServer -Identity DAGNAME -MailboxServer GOODSERVER
4. Then finally, add the server that started this whole mess using the previous command.
Now both servers are part of the same working DAG, the next step is to re-replicate the DB’s.
Since I already had the DB’s on the server and I didn’t delete them (nor the logs), I used the following command to re-add the replication member – Add-MailboxDatabaseCopy -Identity DB1 -MailboxServer BADSERVER –SeedingPostponed. I had to run this command against every DB I had. Using this command, the server know it should replicate only the delta log files, and not the entire DB’s, which in my case ended pretty fast and I was once again able to move active DB’s from one server to the other.