Showing posts with label S4B. Show all posts
Showing posts with label S4B. Show all posts

Tuesday, March 28, 2017

Skype for Business Failover/Failback Issue

I just wrapped up a week long exercise with a client who had a complete failure of their VMware stack at their primary data center resulting in the need to perform an emergency failover to their DR site. This client had a recent deployment and luckily was following all of the best practices and had current backups of everything resulting in a fairly painless failover. The issue that we ran into was with the failback. When we attempted to fail the CMS back replication stopped and the file transfer service would not start. We saw the following in the event Log:

Log Name:      Lync Server
Source:        LS Master Replicator Agent Service
Date:          2/23/2017 8:16:07 PM
Event ID:      2035
Task Category: (2122)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Skype for Business Server 2015, Master Replicator Agent is trying to connect to a backend that whose state does not match with the service sate.

Service State: 
Backup Backend State: 
Active Backend Connection String 
densql01.contoso.com
Cause: Possible issues with back-end database.
Resolution:
Fix the topology so that it matches with the backend and publish.



Log Name:      Lync Server
Source:        LS File Transfer Agent Service
Date:          2/24/2017 1:49:30 AM
Event ID:      1040
Task Category: (1121)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Skype for Business Server 2015, File Transfer Agent service is stopping.

Reason: The service is trying to start as Active service but the backend it is trying to connect is in Backup state. Backend connection string: Data Source=densql01.contoso.com;
                Initial Catalog=xds;
                Integrated Security=True;
                Application Name=File Transfer Agent;Failover Partner=densql02.contoso.com;



Log Name:      Lync Server
Source:        LS Backup Service
Date:          2/24/2017 2:32:56 AM
Event ID:      4080
Task Category: (4000)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DalFE01.contoso.com
Description:
Skype for Business Server 2015, Backup Service central management backup module failed to complete export operation.

Configurations:
Backup Module Identity:CentralMgmt.CMSMaster
Working Directory path:\\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore\Temp
Local File Store Unc path:\\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore
Remote File Store Unc path:\\dencfile01.contoso.com\lyncshare\1-BackupService-6\BackupStore

Additional Message:
 Exception: Microsoft.Rtc.BackupService.ExportOperationException: Export operation (to zip archive \\dalcfile01.contoso.com\lyncshare\2-BackupService-6\BackupStore\Temp\z-CentralMgmt-f908fa8f-db02-4ab3-8338-17c30cf59a97.zip) is failed due to: Failed to execute stored procedure XdsQueryChangesForBackupReplica2. Native Error: 50000, Exception: ###50023:XdsQueryChangesForBackupReplica2:The central management store being accessed is not the active store. No data can be read or any changes can be made to this store.. Retriable: False. Cookie: <repl:Status xmlns:repl="urn:schema:Microsoft.Rtc.Management.Xds.ReplLayer.2008" FromMachine="CDCB9834-6AAC-43ab-8310-0D4D105EA23A" Supports="v1" ProductVersion="6.0.9319.0" />. ---> System.Data.SqlClient.SqlException: ###50023:XdsQueryChangesForBackupReplica2:The central management store being accessed is not the active store. No data can be read or any changes can be made to this store.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
  at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
   --- End of inner exception stack trace ---
   at Microsoft.Rtc.BackupService.BackupModules.XdsBackupModuleBase.QueryChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& isFullSync, ExportedDataStats& overallExportStats, Dictionary`2& queueExportStatsMap)
   at Microsoft.Rtc.BackupService.BackupModules.XdsBackupModuleBase.GetChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& isFullSync, ExportedDataStats& overallExportStats, Dictionary`2& queueExportStatsMap)
   at Microsoft.Rtc.BackupService.BackupModules.CentralMgmtBackupModule.GetChanges(Zipper zipper, String oldCookie, String& newCookie, Boolean& steadyState, Int32& numOfNewChanges, Nullable`1& numOfNewChangesFromTheOtherPool, Nullable`1& hasChangesSince, Boolean& forceSetErrorState, ChangesContext& context)
   at Microsoft.Rtc.BackupService.BackupModuleHandler.SendBackupDataTask.GetChanges(Boolean& steadyState, Int32& numOfNewChanges, Nullable`1& numOfNewChangesFromTheOtherPool, Nullable`1& hasChangesSince, Boolean& forceSetErrorState, ChangesContext& changesContext)
   at Microsoft.Rtc.BackupService.BackupModuleHandler.SendBackupDataTask.InternalExecute()
   at Microsoft.Rtc.Common.TaskManager`1.ExecuteTask(Object state)

Cause: Either network or permission issues. Please look through the exception details for more information.


So I verified that the SCP value was pointing to the primary pool, I logged into the SQL DB's in both sites however and the XDS DBConfigInt table both reported:

On the primary pool the dbo.configint value:
Name Value
CurrentState 3
DbVersionSchema 10
DbVersionSproc 15
DbVersionUpgrade 4
IsXdsReadOnly 0

On the seconday pool the dbo.configint value:
Name Value
CurrentState 3
DbVersionSchema 10
DbVersionSproc 15
DbVersionUpgrade 4
IsXdsReadOnly 0

CurrentState 3 means that they are in a "backup" state and not primary. So we then modified the SCP to point back to the secondary pool:

msRTCSIP-BackEndServer: changed to dalsql01.contoso.com
msRTCSIP-BackEndServermirror: changed to dalsql02.contoso.com

Then we modified the CMS database on the secondary pool's primary SQL server by using the following command:

Update [xds].[dbo].[DbConfigInt] Set Value=0 Where Name='CurrentState'

We then published topology, and replication was working with CMS on the secondary pool. We then re-failedback the CMS to the primary pool and this time it was successful. So at this point CMS was healthy, replication was working, users were able to sign in and make/receive calls, however users could not create new meetings. So I started analyzing the FE's event logs and ran across the following event:

Log Name:      Lync Server
Source:        LS User Store Sync Agent
Date:          2/24/2017 12:30:42 AM
Event ID:      57005
Task Category: (1061)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
Error encountered pushing data to RtcXds Blob Store

#CTX#{ctx:{traceId:1336022626, activityId:"199e5a7e-6a3c-4cde-82cb-3cf3694b01c2"}}#CTX#
Push cycle identifier: [DenFE01.contoso.com.2fd688f5-0f3a-407f-bab5-3fa5c3757443]
ItemCount: [20]
Error Message: [PushController: XdsPublishItems failed: System.Data.SqlClient.SqlException (0x80131904): ###50015:XdsPublishItems:Local write is not supported in system publications.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
ClientConnectionId:4f6d9a2e-01d4-4ca8-b449-2a194446cf67
Error Number:50000,State:1,Class:11]
Cause: Possible issues with back-end database.
Resolution:
Ensure the back-end is functioning correctly.


Log Name:      Lync Server
Source:        LS User Store Sync Agent
Date:          2/24/2017 12:30:42 AM
Event ID:      57006
Task Category: (1061)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      DenFE01.contoso.com
Description:
RtcDb Sync Agent sproc failed

#CTX#{ctx:{traceId:1336022626, activityId:"199e5a7e-6a3c-4cde-82cb-3cf3694b01c2"}}#CTX#
Sproc: [XdsPublishItems]
Exception: [System.Data.SqlClient.SqlException (0x80131904): ###50015:XdsPublishItems:Local write is not supported in system publications.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean asyncWrite)
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method)
   at System.Data.SqlClient.SqlCommand.ExecuteReader()
   at Microsoft.Rtc.Common.Data.DBCore.Execute(SprocContext sprocContext, SqlConnection sqlConnection, SqlTransaction sqlTransaction)
ClientConnectionId:4f6d9a2e-01d4-4ca8-b449-2a194446cf67
Error Number:50000,State:1,Class:11]

We then decided to drain services from one FE at time, and re-run Step 1 and Step 2 from the deployment wizard to reset the local SQL instance on each FE followed up by a reboot. After this process each FE came back up without issue and all functionally was restored.

Microsoft has confirmed that this is a bug and I will try to update this post once Microsoft releases a fix for this bug. 

Thursday, February 18, 2016

How to Disable Interfaces on AudioCodes Mediant 1000

One of our clients recently rolled out AudioCodes Element Management System (EMS) and noticed that they were receiving a lot of alarms about interfaces being down. You might also see these alarms show up on the gateway management page:




I wasn't able to find much online in the way of how to administratively down or disable the alarms on each gateway so I opened a support ticket figured it out and thought I should post this in the event that anyone else out there also needs to do this. 

First login to your gateway and determine which interface you want to turn disable the alarm on. The interfaces are read on the top row beginning as GB_0_1 on the left and then going two, three, four, etc, if you have another row of interfaces then it would be GB_X_1 with X being 1-9




Once you have written down which interface you want to remove, expand VoIP -> Network -> and select Ethernet Groups Table:




Select Index 0 (or whichever index has the interface under the member column) and then click edit:




In the edit record window click the drop down of the member you want to remove, and change it to none:




Click submit, and your changes should show the Index as no longer having that interface listed:




You will then need to restart the gateway for the changes to take effect 

Friday, February 5, 2016

Skype for Business Hybrid Remote PowerShell

I recently began to start working on a couple hybrid deployments both internally and for clients. One of the first things that noticed was it was not as straight forward to get connected to remote PowerShell as it was for Azure AD or Exchange Online. The first thing to note is that if you are in a hybrid and you have your lyncdiscover.domain.com pointed to your on-premise environment you will be greeted with the following error:



Get-CsPowerShellEndpoint : Unable to connect to the remote server
At C:\Program Files\Common Files\Skype for Business
Online\Modules\SkypeOnlineConnector\SkypeOnlineConnectorStartup.psm1:94 char:26
+             $targetUri = Get-CsPowerShellEndpoint -TargetDomain $adminDomain
+                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Get-CsPowerShellEndpoint], WebException
    + FullyQualifiedErrorId : System.Net.WebException,Microsoft.Rtc.Management.OnlineConnector.GetPowerShellEndpointCm

   Dlet

Normally the workaround that has been in place for this is to specify the -OverrideAdminDomain switch and specify your tenant. However I have recently learned that this does not always work. When I tried I was greated with the following error:


New-PSSession : [admin0b.online.lync.com] Processing data from remote server admin0b.online.lync.com failed with the
following error message: The specified tenant 'spscom.onmicrosoft.com' could not be found in current forest. Please
verify the tenant Identity and then try again. For more information, see the about_Remote_Troubleshooting Help topic.
At C:\Program Files\Common Files\Skype for Business
Online\Modules\SkypeOnlineConnector\SkypeOnlineConnectorStartup.psm1:118 char:16
+     $session = New-PSSession -ConnectionUri $ConnectionUri.Uri -Credential $webt ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OpenError: (System.Manageme....RemoteRunspace:RemoteRunspace) [New-PSSession], PSRemotin
   gTransportException
    + FullyQualifiedErrorId : IncorrectProtocolVersion,PSSessionOpenFailed


I opened a ticket with Microsoft and we were able to get connectivity to work by specifying the -OverridePowerShellURI parameter, and then using the same URL that you access the control panel within O365:

New-CsOnlineSession –Credential $cred –OverridePowershellURI –OverridePowershellURI https://admin2a.online.lync.com/OcsPowershellLiveid”

We escalated this issue to the product group in which responded with the following:

There is a known issue currently where DomainUrlMap (what gets used for Autodiscovery) is only being populated with the domains of online enabled users. While our tenant does have some online enabled users, it would appear that those users are all on spscom.com – Autodiscover doesn’t know about the spscom.onmicrosoft.com domain so you get routed somewhat randomly when trying to resolve that domain.

There are two workarounds – 1) you could enable a user for spscom.onmicrosoft.com and subsequently disable it, once the domain is in the DomainUrlMap it should remain there, or 2) use “-OverrideAdminDomain spscom.com”, which is already in the DomainUrlMap.

Solution:

I created a new cloud only user with an onmicrosoft.com UPN, licensed them for Skype for Business Online, and then was able to sucessfully access remote PowerShell:


You can then remove the cloud only user it is only needed to add the onmicrosoft.com domain to the DomainUrlMap

Tuesday, December 29, 2015

Limitations on Transferring Conference Calls

A client recently reported that they were no longer able to transfer conference calls to their mobile phone. I invited them to a conference to test and I was able to, however upon performing a screen share session with them I confirmed that they were not able to. Upon logging into their environment with a test account I confirmed that the test account was also unable to transfer conference calls:


We then tested and we were able to transfer a call when it was a peer to peer session. This made me think that this has to be due to a conferencing or voice policy. I checked and found that there was no difference in the configuration between my configuration and the clients. I spent some time reproducing and looking at each of the client’s .UCCAPI logs thinking it was something that was being denied in the in-band-provisioning however nothing jumped out. 

I finally noticed that it was only allowing me to transfer the conference call to a mobile phone and not to any other user like it would in the peer-to-peer session. I immediately checked the Phones list on the client that was not working and all the fields were blank:



Upon configuring the client with a Mobile phone:



I was then able to transfer the conference call to my mobile:

Solution:


In order to transfer conference calls to another phone, the phone number must be associated with the user’s account either by one of the telephone fields in AD, or by configuring the number within the client itself. 

Monday, December 21, 2015

Skype for Business UI Transfer Call Bug

After working on an issue for about half of a year now and finally finding a workaround for it I wanted to share it with anyone else who might also be having the same issue. Currently there is a known bug within the Skype for Business UI that does not allow users to transfer calls to other phone numbers or voicemail listed on someone’s contact card.

For example, if I was to receive a call, and wanted to transfer the call directly to one of my co-worker’s voice mail, normally you would initiate a transfer, search for the recipient, and right click their name to get a list of alternate call options from their contact card:




 When you select Voice Mail or any other number listed the client does nothing. 


SOLUTION:

A workaround for this issue is to force the Lync 2013 UI for the user with a client policy. Then this functionality still works:




I currently have a Microsoft Premier ticket open and I will update this post once the fix has been published. 

UPDATE: This has been addressed with the February 2016 client CU:
https://support.microsoft.com/en-us/kb/3114732

Friday, December 11, 2015

Federating Lync 2010 Hybrid with Skype for Business Online

I had an interesting case today where a client who was running a Lync 2010 hybrid with O365 Skype for Business online reported that federated partners who were also using Skype for Business Online could not IM, call, screen share, or see presence.  My initial reaction was to check and see if they had configured their on-premises instance for federated domains to use the “sipfed.online.lync.com” proxy FQDN as both Phil Sharp  and our Tom Pacyk had blogged about issues with Lync 2013 when you have the domain configured as both an Edge Server and Hosted Provider. Sure enough they had a couple domains configured this way.  I used the following command to update all the domains to not specify sipfed.online.lync.com as a proxy fqdn:

Get-CsAllowedDomain | Where {$_.ProxyFqdn -eq "sipfed.online.lync.com"} | Set-CsAllowedDomain -ProxyFqdn $null

I forced replication and waited 15 minutes and re-tested, still no luck. I then pulled the client logs and noticed that my messages were resulting in a 504 Server Time-Out:



The entire message did not give me much to go on:



Server: IncomingFederation/6.0.0.0
ms-diagnostics: 1036;reason="Previous hop shared address space peer did not report diagnostic information";Domain="clientpartnerdomain.org";PeerServer="sipfed.online.lync.com";source="sip-na.clientdomain.com"
ms-edge-proxy-message-trust: ms-source-type=AuthorizedServer;ms-ep-fqdn=na1.clientdomain.com;ms-source-network=federation;ms-source-verified-user=verified

I then decided to collect SIP and S4 traces from the edge server while attempting to IM a user on S4B online, the trace at this point also did not provide much information other than that it was being routed correctly but that once it reached O365 it would just timeout:



At this point I felt that this had to be something with O365’s Skype for Business settings and not an issue with our client’s on premises configuration. So I checked the portal’s settings and they had configured “On only for allowed domains” and enabled “Let people use Skype for Business to communicate with Skype users outside your organization”.



So federation was enabled, however it was only for specific domains, so I added the clientpartnerdomain.org to the list of allowed domains (which was empty) and then waited 30 minutes and sure enough it worked!


SOLUTION:


Make sure that if you have a hybrid configuration that your on premises allowed domains, are also listed in your O365 tenant! 

Thursday, September 24, 2015

The Never Ending Call Forward

Recently I had a client that reported they had a user who went on vacation and had configured their call forwarding settings to forward all their calls to the office receptionist while they were out. When they returned they disabled call forwarding however all of their calls were still being forwarded to the receptionist. Now this is not the first time I have heard of this issue, so I began by first taking a look at their call forwarding settings using SEFAUtil as sometimes the client and the pool won't sync and it takes forcibly changing it via SEFAUtil to change the setting. I pulled the user's config however there was nothing set:


I then tried using Anthony Caragol's Call Forwarding Script just to double check and see if maybe SEFAUtil wasn't reporting something correctly but it reported the same thing:



I thought that this was odd, so I decided to reproduce the issue and pull a trace from the FE pool. Doing so I saw that the call rang the user, then was responded with a PRAC and the forwarding user's identity:


Looking at the closest traces yielded nothing so I decided to also trace from the SBA that the user was registered to. This showed that a Added P-Asserted-Identity from EPID ashirey@clientdomain.com was being sent. Meaning that an endpoint was causing the call forwarding. Now this also was not unheard of however I found it strange that her client was showing no call forwarding and it was and endpoint. So I asked the user to log off of her client and then wait 1 minute then sign back in an re-test. This also yielded the same behavior and same message in the trace. 

SOLUTION:
I was starting to run out of ideas and so I had one last thought, why not have the user that was receiving the forwarded calls also sign out. After having both users sign out and remain signed out for 5 minutes, then signing back in, the call forwarding was finally disabled. 

So when in doubt while troubleshooting call forwarding have all the users sign out that are involved.

Tuesday, September 22, 2015

Lesson Learned: Issues with Signature Algorithms

Over the weekend we had a client whose Exchange and Lync internally signed certificates expired. On Sunday I went through their entire deployment and replaced all of the certificates with new certs issued by their PKI. 

Monday morning, they reported that their Lync clients were working, however all of their phones were not able to sign in. I immediately realized I had not replaced the certificate on their F5 and did not have the management IP and credentials. The IT person who had this information was on a plane and would be out of pocket for a couple hours. Therefore, in order to avoid an extended outage I attempted to change DNS for the VIP, that I thought their phones (CX600) were using, to point directly to one of the FE’s and skip the F5. After making the change they reported that all of their phones were still down. We could not get logs from the phones, so I thought that this had to still be somehow connected with the F5 but could not be certain due to lack of logs.

I was able to obtain the information for the F5 from a colleague luckily and began the certificate replacement process however the F5 would NOT accept the certificate. I reached out to one of our MVP’s Jeff Guillet and we were still not able to get it to take the certificate. I then escalated to F5 support at which point we attempted to export and import the certificate in a multitude of different ways. We tried the certificate by itself, no extended properties, importing via text file, importing it via CLI nothing seemed to work. When we pulled a packet capture, we saw the client hello, however we did not see a server hello in response:

Running via CLI: tcpdump -n -i 0.0:nnn -s0 -w /var/tmp/1-1475239048.pcap host 192.168.2.22 or 192.168.2.47 –vvv
Output:




We than ran an OpenSSL command on the F5 that would dump the certificate information when an attempt to connect to the VIP was made, this resulted in no certificate being sent:

[admin@sac-f5-02:Active:Changes Pending] ~ # openssl s_client -connect 192.168.2.148:443:


At this point, we swapped the old expired certificate back and verified that we were able to obtain output with a certificate warning which we could and running the same command showed the old cert and chain:

[admin@sac-f5-02:Active:Changes Pending] ~ # openssl s_client -connect 192.168.2.148:443


We then attempted a couple other variations of importing and exporting the certificate. We enabled debug logging on the SSL components, and then dumped the SSL log to the CLI:

tmsh modify /sys db log.ssl.level value Debug
tailf /var/log/ltm |grep -i 'ssl'

However this resulted in nothing showing up, I verified that logging was working by hitting another one of the VIP’s and the connection showed up in the logs. We then attempted to reboot the passive F5, and failover to that unit once it came back online in an attempt at answering the age old question “Did you reboot?” however this also did not make a change. We once again tried a series of imports and exports on the unit just to make sure it wasn’t a combination of the reboot failover and importing. No luck.

We tried one other command that essentially makes a connection and then dumps the output of that connection:



At this point, our client had been without phones for a little more than half the day, we had already escalated at F5 and had three of their support engineers on the call. They sent out an all support announcement as we had stumped most of their support staff and engineering also was out of ideas. Finally someone got back to them and asked “What signature algorithm was being used?” We immediately pulled the certificate information from the F5:

openssl x509 -in /config/filestore/files_d/Common_d/certificate_d/\:Common\:Lync2013-Web-int-2015-V4.crt_51169_1  -noout -text


We responded to the individual who asked, who then brought it to our attention that F5 does not support the RSASSA-PSS algorithm. We were able to find a posting on F5’s support forums that described a similar output from another user when suing RSASSA-PSS:


We were wondering why this all of a sudden started occurring. We had recently migrated their PKI from a single Root/Issuing server to a two tier PKI however it was supposed to be a 1:1 migration and no settings/configuration was to be changed outside of making it two tiers. We decided to check a certificate issued by their old root/issuing CA:


Then looking at all the certificates issued by the new intermediate/issuing CA:


A quick search on the internet also showed that Adobe, Citrix, Cisco, Firefox, and VMWare all do not support this algorithm and/or have various issues with its use. Various blog posts and forum entries alluded to that you had to rebuilt your PKI if this was the case. At this point we thought that we had two options, purchase a 3rd party certificate for the F5 with just the pool name or bypass SSL on the F5. After informing the client they elected to go with a Godaddy certificate. After obtaining, installing and verifying that it worked, we asked the client to then test a phone. They reported back that the phones were still not able to sign in…. so we then pulled the DHCP options and lo and behold they were pointing to the FE pool directly and not the F5. So all of this work on the F5 while important, was not the root cause of the phone issue. I immediately thought well if all of these companies and their devices don’t support RSASS-PSS then maybe the phones don’t either. Sure enough the Polycom CX line of phones does NOT support it!

EDIT: I have also been informed that the VVX line running at least 5.3.1 also do not support RSASSA-PSS.

We knew at this point we were looking at having to change their PKI due to the fact that we needed a .local on their FE pool’s internal cert and could not obtain that from Godaddy. We opened a up a ticket with Microsoft PSS and while waiting on the call back started looking deeper into how the PKI was setup. We noticed that their root CA and Intermediate/Issuing CA both were using the sha1RSA signature algorithm and not the RSASSA-PSS:


We thought this was strange, why was the intermediate CA issuing certs with a different signature algorithm than what their own certs were using. We attempted cloning the web server template and selecting different Cryptography providers, however this also did not work. We did a bit more research and noticed that one of the common “resolutions” when rebuilding the PKI was that we needed to disable alternatesignaturealgorithm by setting its value in the registry to 0. So we decided because the root and intermediate CA’s were using sha1RSA why not just try disabling that on the intermediate. 

Solution:
We made the change using the following command followed by restarting the certificate service:

certutil -setreg csp\alternatesignaturealgorithm 0
net stop certsvc && net start certsvc

We then reissued the FE pool cert and lo and behold we finally had a certificate using an acceptable signature algorithm:



We immediately assigned the Lync FE pool to use this new cert and were able to confirm with the client that their phones were able to sign in!

Lesson Learned: Check your signature algorithm when migrating PKI’s and whatever you use check for compatibility!

A big thanks to Jeff Guillet, Rick Steele, and Scott Winslow for assisting in this effort!

Friday, August 14, 2015

Script: Pool Objects

After my issue this week with attempting to decommission a Lync 2013 pool due to a lingering object, I decided to write a simple script that takes the pool name that you are attempting to remove, and runs though all possible objects and outputs any missed that are still attached to the pool.

Download OrphanedObjects.ps1

I have tested and confirmed that this works on Lync 2013 and S4B. It does NOT work on Lync 2010, but if I get enough interest in a 2010 version I could modify it to work.

Searching for Orphaned Objects

Recently when decommissioning a Lync 2013 pool, I ran into an issue where it said that by publishing the topology and removing the pool I would orphan existing users, endpoints, or devices:




Consult your Skype for Business Server documentation to learn how to move or disable objects still homed on the pool. To find those objects, execute the following cmdLets: Get-CsUser, Get-CsExUmContact, Get-CsCommonAreaPhone, Get-CsAnalogDevice, Get-CsRgsWorkflow, Get-CsDialInConferencingAccessNumber, Get-CsAudioTestServiceApplication, Get-CsTrustedApplicationEndpoint, Get-CsPersistentChatEndpoint.

Now I knew there were more objects that could trigger the orphaned objects failure, so I also ran Get-CsConferenceDirectory, Get-CsTrustedApplication, Get-CsTrustedApplicationComputer, Get-CsTrustedApplicationEndpoint, Get-CsTrustedApplicationPool. The only result I got was aCsAudioTestServiceApplication: 



I attempted to remove this however there is no remove cmdlet, and there is no option to set it's registrar pool with Set-CsAudioTestServiceApplication. 

I confirmed with one of my co-workers that it was not possible to remove this, however it should not be an object that prevents the removal of a pool. So what was causing the hold? 

Solution:
I decided to go into the weeds and dumped all user objects from AD to a CSV via CDSVE ( csvde -f test.csv -r objectClass=user ), and then searched for all users who had the msRTCSIP-PrimaryHomeServer value that matched the Lync 2013 server. Low and behold there was a Lync Room System object that was still homed to the Lync 2013 pool. After moving this object to the new S4B pool, I was then able to publish the topology.


Thursday, August 13, 2015

Lync 2013 to S4B CMS Migration Replication Issues

I recently moved the CMS from Lync 2013 to a new S4B pool for a project I was working on. I followed the normal procedure and re-ran bootstapper on all the nodes that make up the new pool hosting the CMS, as well as the old Lync 2013 pool to remove the CMS role. I verified that the S4B Master Replicator Agent and File Transfer Agent services were running on all four of the new S4B nodes. I rebooted all four of the new S4B servers individually and once complete I attempted to view the CMS replication status however it reported nothing was updating. All entries show UpToDate False, and all of them except the node I ran the Move-csManagmentServer cmdlet from and the edge servers show laststatusreport from around the time I performed the move:




I verified that I was able to download the Topology; I could see that FE01 was the ActiveMaster of the CMS:



I ran a trace using ManagmentCore scenario and saw a couple errors regarding the XDS-Replica folder:


This line stuck out particularly:

Query changes operation failed. Exception [System.UnauthorizedAccessException: Access to the path '\\S4BFENJ01.spscom.com\xds-replica\xds-master\xds-master\working\replication\tmp\0c112834-9f3e-49bc-ba01-fb0e4227e56e' is denied.

At this point I attempted to recreate the XDS-Replica folder by following Ken’s blog http://ucken.blogspot.com/2012/04/resetting-lync-cms-replication.htm ) however this didn’t seem to solve it. At this point I knew it had to be something to do with permissions/authentication. I checked and verified that all the servers in the new S4B pool were members of the RTCUniversalConfigReplicator group. 

Solution:
I finally enabled CAPI2 logs and saw that there was an expired certificate being passed. So I re-ran the deployment wizard, checked and sure enough the OAuth certificate had expired. Renewing the certificate and restarting each server propagated the OAuth certificate and replication began to work.