Tuesday, March 28, 2017

Skype for Business Failover/Failback Issue

I just wrapped up a week long exercise with a client who had a complete failure of their VMware stack at their primary data center resulting in the need to perform an emergency failover to their DR site. This client had a recent deployment and luckily was following all of the best practices and had current backups of everything resulting in a fairly painless failover. The issue that we ran into was with the failback. When we attempted to fail the CMS back replication stopped and the file transfer service would not start. We saw the following in the event Log:

Log Name:      Lync Server
Source:        LS Master Replicator Agent Service
Date:          2/23/2017 8:16:07 PM
Event ID:      2035
Task Category: (2122)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Skype for Business Server 2015, Master Replicator Agent is trying to connect to a backend that whose state does not match with the service sate.

Service State: 
Backup Backend State: 
Active Backend Connection String 
densql01.contoso.com
Cause: Possible issues with back-end database.
Resolution:
Fix the topology so that it matches with the backend and publish.



Log Name:      Lync Server
Source:        LS File Transfer Agent Service
Date:          2/24/2017 1:49:30 AM
Event ID:      1040
Task Category: (1121)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Skype for Business Server 2015, File Transfer Agent service is stopping.

Reason: The service is trying to start as Active service but the backend it is trying to connect is in Backup state. Backend connection string: Data Source=densql01.contoso.com;
                Initial Catalog=xds;
                Integrated Security=True;
                Application Name=File Transfer Agent;Failover Partner=densql02.contoso.com;



Log Name:      Lync Server
Source:        LS Backup Service
Date:          2/24/2017 2:32:56 AM
Event ID:      4080
Task Category: (4000)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DalFE01.contoso.com
Description:
Skype for Business Server 2015, Backup Service central management backup module failed to complete export operation.

Configurations:
Backup Module Identity:CentralMgmt.CMSMaster
Working Directory path:\\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore\Temp
Local File Store Unc path:\\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore
Remote File Store Unc path:\\dencfile01.contoso.com\lyncshare\1-BackupService-6\BackupStore

Additional Message:
 Exception: Microsoft.Rtc.BackupService.ExportOperationException: Export operation (to zip archive \\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore\Temp\z-CentralMgmt-f908fa8f-db02-4ab3-8338-17c30cf59a97.zip) is failed due to: Failed to execute stored procedure XdsQueryChangesForBackupReplica2. Native Error: 50000, Exception: ###50023:XdsQueryChangesForBackupReplica2:The central management store being accessed is not the active store. No data can be read or any changes can be made to this store.. Retriable: False. Cookie: <repl:Status xmlns:repl="urn:schema:Microsoft.Rtc.Management.Xds.ReplLayer.2008" FromMachine="CDCB9834-6AAC-43ab-8310-0D4D105EA23A" Supports="v1" ProductVersion="6.0.9319.0" />. ---> System.Data.SqlClient.SqlException: ###50023:XdsQueryChangesForBackupReplica2:The central management store being accessed is not the active store. No data can be read or any changes can be made to this store.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
  at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
   --- End of inner exception stack trace ---
   at Microsoft.Rtc.BackupService.BackupModules.XdsBackupModuleBase.QueryChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& isFullSync, ExportedDataStats& overallExportStats, Dictionary`2& queueExportStatsMap)
   at Microsoft.Rtc.BackupService.BackupModules.XdsBackupModuleBase.GetChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& isFullSync, ExportedDataStats& overallExportStats, Dictionary`2& queueExportStatsMap)
   at Microsoft.Rtc.BackupService.BackupModules.CentralMgmtBackupModule.GetChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& steadyState, Int32& numOfNewChanges, Nullable`1& numOfNewChangesFromTheOtherPool, Nullable`1& hasChangesSince, Boolean& forceSetErrorState, ChangesContext& context)
   at Microsoft.Rtc.BackupService.BackupModuleHandler.SendBackupDataTask.GetChanges(Boolean& steadyState, Int32& numOfNewChanges, Nullable`1& numOfNewChangesFromTheOtherPool, Nullable`1& hasChangesSince, Boolean& forceSetErrorState, ChangesContext& changesContext)
   at Microsoft.Rtc.BackupService.BackupModuleHandler.SendBackupDataTask.InternalExecute()
   at Microsoft.Rtc.Common.TaskManager`1.ExecuteTask(Object state)

Cause: Either network or permission issues. Please look through the exception details for more information.


So I verified that the SCP value was pointing to the primary pool, I logged into the SQL DB's in both sites however and the XDS DBConfigInt table both reported:

On the primary pool the dbo.configint value:
Name Value
CurrentState 3
DbVersionSchema 10
DbVersionSproc 15
DbVersionUpgrade 4
IsXdsReadOnly 0

On the seconday pool the dbo.configint value:
Name Value
CurrentState 3
DbVersionSchema 10
DbVersionSproc 15
DbVersionUpgrade 4
IsXdsReadOnly 0

CurrentState 3 means that they are in a "backup" state and not primary. So we then modified the SCP to point back to the secondary pool:

msRTCSIP-BackEndServer: changed to dalsql01.contoso.com
msRTCSIP-BackEndServermirror: changed to dalsql02.contoso.com

Then we modified the CMS database on the secondary pool's primary SQL server by using the following command:

Update [xds].[dbo].[DbConfigInt] Set Value=0 Where Name='CurrentState'

We then published topology, and replication was working with CMS on the secondary pool. We then re-failedback the CMS to the primary pool and this time it was successful. So at this point CMS was healthy, replication was working, users were able to sign in and make/receive calls, however users could not create new meetings. So I started analyzing the FE's event logs and ran across the following event:

Log Name:      Lync Server
Source:        LS User Store Sync Agent
Date:          2/24/2017 12:30:42 AM
Event ID:      57005
Task Category: (1061)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Error encountered pushing data to RtcXds Blob Store

#CTX#{ctx:{traceId:1336022626, activityId:"199e5a7e-6a3c-4cde-82cb-3cf3694b01c2"}}#CTX#
Push cycle identifier: [DenFE01.contoso.com.2fd688f5-0f3a-407f-bab5-3fa5c3757443]
ItemCount: [20]
Error Message: [PushController: XdsPublishItems failed: System.Data.SqlClient.SqlException (0x80131904): ###50015:XdsPublishItems:Local write is not supported in system publications.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
ClientConnectionId:4f6d9a2e-01d4-4ca8-b449-2a194446cf67
Error Number:50000,State:1,Class:11]
Cause: Possible issues with back-end database.
Resolution:
Ensure the back-end is functioning correctly.


Log Name:      Lync Server
Source:        LS User Store Sync Agent
Date:          2/24/2017 12:30:42 AM
Event ID:      57006
Task Category: (1061)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
RtcDb Sync Agent sproc failed

#CTX#{ctx:{traceId:1336022626, activityId:"199e5a7e-6a3c-4cde-82cb-3cf3694b01c2"}}#CTX#
Sproc: [XdsPublishItems]
Exception: [System.Data.SqlClient.SqlException (0x80131904): ###50015:XdsPublishItems:Local write is not supported in system publications.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
ClientConnectionId:4f6d9a2e-01d4-4ca8-b449-2a194446cf67
Error Number:50000,State:1,Class:11]

We then decided to drain services from one FE at time, and re-run Step 1 and Step 2 from the deployment wizard to reset the local SQL instance on each FE followed up by a reboot. After this process each FE came back up without issue and all functionally was restored.

Microsoft has confirmed that this is a bug and I will try to update this post once Microsoft releases a fix for this bug. 

No comments:

Post a Comment