Monday 24 February 2014

Do Lync 2010 clients dream of lyncdiscover records?

Lync client discovery should be well-known subject for being widely documented. Most will know that Lync 2010 client is (or was?) SRV record-oriented, that is, its preferred discovery methods were based on the two following records:

_sipinternaltls._tcp.sipdomain.com for internal  Lync pool
_sip._tls.sipdomain.com for external Lync edge pool

whereas Lync 2013 family clients (with some differences on the Lync store app) are A record-oriented, that is, its preferred discovery methods are based on the two following records:

lyncdiscoverinternal.sipdomain.com for internal Lync pool
lyncdiscover.sipdomain.com for external Lync edge pool

It seems to be common knowledge that more recent Lync 2010 clients introduced the same behaviour as Lync 2013 clients, supposedly having the feature to now query the lyncdiscoverinternal and lyncdiscover added through an unspecified CU (or, I was unable to find specific information).

This seems to be confirmed by the following Technet article and several other sources. Specifically:


"(...) Microsoft Lync 2010, Lync 2013, and Lync Mobile are similar in how the client finds and accesses services in Lync Server 2013 (...)" "(...) For all clients except for the Lync Windows Store app During DNS lookup, SRV records are queried and returned to the client in the following order:
  1. lyncdiscoverinternal.<domain>   A (host) record for the Autodiscover service on the internal Web services
  1. lyncdiscover.<domain>   A (host) record for the Autodiscover service on the external Web services
  1. _sipinternaltls._tcp.<domain>   SRV (service locator) record for internal TLS connections
  1. _sipinternal._tcp.<domain>   SRV (service locator) record for internal TCP connections (performed only if TCP is allowed)
  1. _sip._tls.<domain>   SRV (service locator) record for external TLS connections
 " (...) Cumulative updates to the desktop clients change the DNS location process from Lync Server 2010 (...)"

On a customer's site, with Lync 2013 server, and an unusual proliferation of both Lync 2010 and 2013 desktop clients, however, a different behaviour was noted. The scenario was:

- Lyncdiscover and lyncdiscoverinternal records were deployed
- _sipinternaltls.tcp and _sip.tls records were NOT deployed

Lync 2010 clients were still able to perform server discovery and sign in both internally and externally, but it was determined Lync 2010 discovery happened through sip.domain.com A record (existing on both sides of split DNS) and not through Lyncdiscover and lyncdiscoverinternal. This was randomly discovered as sip.domain.com records were briefly removed due to a sip domain migration and caused Lync 2010 client sign in to start failing.

A test seemed to further confirm this:
This is a Wireshark trace of a Lync 2013 client sign-in process, and it shows the expected and documented behaviour:



This is a trace of a Lync 2010 client (January 2014 update - 4.0.7577.4419) - there is no trace of client trying to query lyncdiscoverinternal and lyncdiscover; SRV records are queried first, then failing back to sip.domain.com A record.



My takeaway was: ensure _sipinternaltls.tcp and _sip.tls records are deployed whenever Lync 2010 clients are or may be part of the game :)

Wednesday 19 February 2014

About the dreaded 504 timeout error with Lync Push Notifications (rings a bell?)

Like several others, I was pestered with issue where at a customer site, Lync 2013 Push notifications to Apple and Windows Phone failed to work despite the apparently correct configuration. The error was 504 timeout, which is thrown out by the push notification test cmdlet:

Test-CsMcxPushNotification -AccessEdgeFqdn

(where -AccessEdgeFQDN is the internal Edge Pool FQDN).


Alas, this is a frequent although (fortunately) well-documented issue on a number of great blogs with several possible (and working) resolutions. Worth nothing reminding proper Push Notification functionality is dependent on 
But what happens if everything else fails and you still get stuck with it?
After some further research and attempts, the only quick fix for me was to break federation and re-create from scratch like follows:

1) remove federation with push.lync.com
2) remove federated provider for LyncOnline
3) disable push notifications
Set-CsPushNotificationConfiguration -EnableApplePushNotificationService $False -EnableMicrosoftPushNotificationService $False

3) add back Lync Online hosting provider:
New-CsHostingProvider -Identity "LyncOnline" -Enabled $True -ProxyFqdn "sipfed.online.lync.com" -VerificationLevel UseSourceVerification

4) add back allowed federated domain for Push Notifications:
New-CsAllowedDomain -Identity "push.lync.com"


5) Re-enable push notifications: 
Set-CsPushNotificationConfiguration -EnableApplePushNotificationService $True -EnableMicrosoftPushNotificationService $True

6) Worth checking that Federation is enabled:
Set-CsAccessEdgeConfiguration -AllowFederatedUsers $True
Tested again, and, happy days :)

If that did not work for you, or possibly BEFORE you try, I suggest you check these great resources for alternative resolutions:


Tuesday 18 February 2014

Lync SIP trunks: internet or private?

A recurrent question during Lync technical workshops (and a key design decision in Lync Enterprise Voice deployments with PSTN SIP trunking) is identifying the most appropriate SIP trunk type. Internet or private?

SIP trunks can be deployed over the Internet or over a dedicated private WAN connection (MPLS typically); each, with supposedly clear pros and cons. In reality, there's a bit of confusion around this topic and a decision is not necessarily straightforward.

There is great deal of public literature around the subject if you want to dive deeper into each aspect. The purpose of of the article is only to provide a quick reference table for elements and criteria that should be considered and discussed to determine which circuit is most suitable.

FeatureInternet-based SIP trunkMPLS-based SIP trunk
ProsConsMitigating factorsProsConsMitigating factors
PROVISIONINGFast delivery times. Easy to set upSlower delivery times. More complex to set up. Likely to require a dedicated circuit to SIP trunk carrierIf core MPLS carrier is also a Telco might be able to provide SIP trunk on existing circuit
COSTRelatively cheap - lower TCOExpensive - higher TCO
NETWORK READINESSPratically 100% businesses already connected to internet and might potentially have capacity for SIP trunking. If not, in-place bandwidth upgrade is usually viable and requires no infrastructure upgradeHigher investment for small businesses that don't have MPLS circuits in place
BANDWIDTHRelatively abundant and cheapBandwidth is not guaranteed (best effort, no SLA). Asynchronous connections like ADSL provide limited upstream bandwidthEnsure accurate capacity planning is carried out to ensure bandwidth is adequate for expected number of PSTN sessions. Lync CAC may also helpBandwidth is guaranteed with SLASignificantly more expensive than Internet bandwidthAppropriate capacity planning and QoS adoption would make significantly more efficient use of bandwidth
AVAILABILITYUptime is not guaranteed (best effort, no SLA)Redundant internet connections to different carriers may reduce downtimeUptime is usually guaranteed with SLA and significantly higher that internet connections
NETWORK PERFORMANCEMore network hops, more subject to packet loss, latency and jitter. Unpredictable performance irrespective of bandwidthGeographically closer SIP trunk carriers might require fewer hops to their infrastructure. Network performance is more predictable especially if QoS is implemented. MPLS less subject to packet loss, latency and jitter
CALL-CARRYING CAPACITYUsually more abundant nominal bandwidth theorically allows for a greater call-carrying capacityCall-carrying capacity is influenced by many other factors and network conditions, which become more likely as sessions are added on the wire. Internet-based SIP trunking is generally advisable for low call volume requirementsCall-carrying capacity is more easily predictable and voice traffic can be shaped. MPLS-based SIP trunking advisable for higher call volumes
QoSQoS cannot be implemented end-to-end. Traffic cannot be prioritised by typeBuying a dedicated additional Internet line for SIP trunk would help segregating data/web and voice traffic, yet proper QoS is not achievableQoS can be implemented. Proper traffic prioritisation by type is viable
SECURITYVoice traffic flows through uncontrolled network and may potentially be intercepted. Use a SIP trunk provider that supports TLS (encrypts SIP signalling and SRTP (encrypts media), or deploy SIP trunk through VPN (less recommended: adds further network overhead and may impact QoE)Voice traffic flows through private and screened network. TLS and SRTP are still advisable for enhanced security
NATSIP protocol not NAT-friendly, several known issues when NAT is usedPrefer avoiding NAT when deploying SIP trunk. Alternatively, ensure firewalls are SIP-aware (SIP ALG)NAT less likely to be deployed in MPLS network, but if so, same considerations applyprefer avoiding NAT when deploying SIP trunk. Alternatively, ensure firewalls are SIP-aware (SIP ALG)




Wednesday 8 January 2014

Personal experience and pain points migrating OCS 2007 R2 to Lync 2013

I was recently involved in an OCS 2007 R2 to Lync Server 2013 migration and I thought it would be a good idea to share my experience, with a specific focus on what went wrong, unexpected, or undocumented.
I won’t get into the details on how to migrate. This is already widely documented on a number of blogs as well as TechNet (http://technet.microsoft.com/en-us/library/jj205375.aspx).
ENVIRONMENT: OCS 2007 R2. Two-node Enterprise pool, no external deployment (edge servers). Polycom CX-700 phones. Two Sonus (NET) VX 1200 gateways terminating an ISDN-30 trunk each and used as hybrid gateways (that is, they served as mediation servers and no OCS mediations are present). OCS is the main voice platform for the customer. No other PBX or phones were present.
Below were my pain points in order of occurrence.

TOPOLOGY MERGE

Adding the Lync pool was a nifty and painless work, until topology merge. Next steps, that is, merging the topology and importing legacy configuration to Lync is where I had suspected some possibly unpredictable results due to the hybrid gateways model in OCS 2007 R2. I have struggled finding specific documentation about migrating such scenario. Biggest question mark is would Lync pool be able to use media gateway based mediation? I assumed that could be a no.
Merging topologies on the topology builder was an apparently straightforward step with the following warnings on the log:
2013-09-25 14:04:56 INFORMATION :  No new Mediation Server added to the Office Communications Server 2007 / Office Communications Server 2007 R2 deployment.Cannot find any Office Communications Server 2007 / Office Communications Server 2007 R2 "MediationServer" in the deploymentList of Office Communications Server 2007 / Office Communications Server 2007 R2 "Trusted application server" roles being migrated
Cluster fully qualified domain name (FQDN) "vx1.contoso.local"Computer fully qualified domain name (FQDN) "vx1.contoso.local"
2013-09-25 14:04:56 INFORMATION:  UCMA application with the cluster fully qualified domain name (FQDN) "vx1.contoso.local" does not depend on a pool.2013-09-25 14:04:56 INFORMATION:  UCMA application with the cluster fully qualified domain name (FQDN) "vx2.contoso.local" does not depend on a pool.
The result was media gateways being added as trusted application entries in the BackCompatSite.

IMPORT LEGACY CONFIGURATION

As no OCS 2007 R2 mediations servers were detected in the legacy topology, I wondered how the legacy configuration import procedure would react, and I was ready to redo the voice configuration in Lync if required. I run:
Import-CsLegacyConfiguration
I got the following warning for each route and gateway:
2013-09-25 14:10:23 WARNING:  Cannot find a Mediation Server with the fully qualified domain name (FQDN) "vx2.contoso.local". Run "Merge-CsLegacyTopology" cmdlet before using this cmdlet or make sure that a PSTN route is pointing to a valid Mediation Server. Skipping creation of a Lync Server 2013 PSTN route setting with the name "Emergency Services". numberPattern:  "^(\+999$)"
It seemed, Lync was expecting to see domain-joined “proper” mediation servers in the topology but that was not the case.
The result on Lync topology and configuration was the following:
  • Dial Plans (aka location profiles in OCS) were migrated fine
  • PSTN usages were migrated fine
  • Routes were migrated fine, despite the warning above saying Skipping creation of a Lync Server 2013 PSTN route setting with the name (…). However, all had a null (empty) gateway.
  • Media gateways were imported in the legacy topology (BackCompatSite) as trusted application servers
  • No media gateways or trunks were migrated in the Lync topology
Rather than trying to have Lync 2013 use the mediation on the hybrid gateways, I re-added the VXs as PSTN gateways and created new trunks both on Lync topology and media gateways so that migrated users would immediately be using the new Lync routes and mediation servers.
There was an additional warning in the legacy configuration import. This was a widely documented exception, as Lync 2013 does not accept certain characters in names.
2013-09-25 14:10:23 WARNING:  Policy/setting name  - "Service: Medium" has either ":" or "/". Import-CslegacyConfiguration is replacing them with "_" before migrating them. Office Communications Server 2007/Office Communications Server 2007 R2 policy/setting name  - "Service: Medium". Lync Server 2013 policy/setting name - "Service_ Medium"
RESPONSE GROUP MIGRATION
By all extent, the hugest pain point and that did not come out as an utter surprise, as past experience and diverse publicly available literature suggest this is not a hassle free step.
With this in mind, before migration I strongly suggest to:
  1. Carefully document every low-level aspect of existing response groups: queues, agents, groups, workflows, everything. Level of detail must allow you to recreate all response groups from scratch on Lync, in case anything goes VERY wrong. Might require some time but do yourself a favour and don’t overlook this step. In my case, with around 80 objects among workflows, queues and groups to document took a while, but it worth every second.
  2. Backup the response groups on OCS. Use the following command:applicationsettingsexport.exe /backup /pool:ocspool.contoso.com /applicationID:Microsoft.RTC.Applications.Acd /file:ResponseGroupExport.xml
  3. Do a sanity check on OCS response groups:
  • remove all orphaned agents
  • Ensure you have not renamed any of the agents in AD. There have been reports of RG migration failing because agents had certain uses attributes (name) changed in AD. If in doubt, remove the agent and add it back to the agent lists as well as in any other group.
Once done, I run the following to migrate response groups from OCS to Lync:
Move-CsRgsConfiguration -Source ocspool.contoso.com -Destination lyncpool01.contoso.com
Worth mentioning:
  1. This is a one-off step. You cannot migrate selected response groups or objects. It's all or nothing.
  2. The only RG resources actually "moved" to Lync are the contact objects representing the RG along with related sip uri. Lync now becomes the RG owner. All other objects (queues, workflows, agents) are simply “mirrored” to Lync, and a copy of everything is retained on OCS for rollback purposes (however, you cannot use the above command for that).  After the service has been migrated, all calls to a Response Group phone number will be handled by Lync 2013. Calls will no longer be handled by OCS.
  3. There are several other requirements before you can run the command. Check http://technet.microsoft.com/en-us/library/gg398782.aspx for more details.
I run the command and apparently got no errors when executing. As usual with PowerShell, no output is displayed if successful (unless you use the –verbose switch).
I then checked all RG objects would show up on Lync, and so it was. I tried to PSTN call to one response group, but failed. However, same response group could be called through a Lync call. A quick check lead me to determine the tel uri field in all workflows (around 20) was empty.
I was far less than excited realising not all information was migrated over, but as everything else seemed to have been copied fine, I assumed it was just a matter of repopulating the tel uri again.
I tried with the first workflow, and next bad surprise showed up:
Response Group Update Failure : An instance with ID "a0624626-2744-40c4-b2f7-b2e2a99c8a95" exists with a different OwnerPool. Changing OwnerPool on an existing object is not supported.
An apparently undocumented error, or, at least, unknown to Google :-)
Furthermore, strange entries showed up on OCS pool servers.
Log Name: Office Communications Server
Source: OCS User Services
Date: 10/9/2013 5:46:11 PM
Event ID: 30951
Computer: OCS-FE1.contoso.local
Description:
Active Directory indicates that user is homed on a different server but user data exists on this server as well.
Active Directory Object with guid {C99F9636-53C6-4760-857D-36FFCD54F667} and SipUri rg1@contoso-int.com is listed as being homed on lyncpool01.contoso.com.
Cause: It is possible that the Active Directory attribute msRTCSIP-PrimaryHomeServer has been incorrectly modified or the user has been improperly re-homed using outdated administration tools.
Log Name: Office Communications Server
Source: OCS Response Group Service
Date: 10/9/2013 5:46:11 PM
Event ID: 31053
Computer: OCS-FE1.contoso.local
Description:
Office Communications Server 2007 R2, Response Group Service was not able to establish the application endpoint.
The following exception occurred when establishing application endpoint associated with 'sip:rg1@contoso-int.com': Microsoft.Rtc.Signaling.RegisterException - 482 - The endpoint was unable to register. See the ErrorCode for specific reason..
Cause: Failed to connect to Front End server or the Front End server is misconfigured.
Resolution: Check the Front End server for errors.
Log Name: Office Communications Server
Source: OCS Response Group Service
Date: 10/9/2013 5:46:11 PM
Event ID: 31189
Computer: OCS-FE1.contoso.local
Description:
Application endpoint has been terminated and Office Communications Server 2007 R2, Response Group Service has recreated it.
Application endpoint associated with 'sip:rg1@contoso.com' has been terminated and Office Communications Server 2007 R2, Response Group Service has recreated it.
Long story short, I was unable to change anything in the workflow. With little time available to determine the root cause, I tried to delete and recreate all workflows. Fortunately, that worked and RGs were back in service in around one hour. I didn’t regret any minute spent documenting :-)
Surprises were not over, however. The next morning, as the customer attempted to add an agent to a group, a now “familiar” exception knocked the door again:
Error1
To recreate or not to recreate groups? With functional RGs and enough time for a deep analysis, I decided to investigate further on the root cause. First step, a Lync trace confirmed the error:
Error2
Then pointed the finger to the SQL backend, specifically to the rgsconfig database where response groups configuration is stored; specifically, searching for possibly conflicting guid on the OwnerPoolID or wrong OwnerPool.
First check was in the dbo.OwnerPool table and took a note of the Lync pool ID to determine if that would be the correct one in other objects:
Error3
Then searched in the dbo.AgentGroups table for the OwnerPoolID field. I was expecting a wrong value; surprisingly, all values were NULL.
I then manually copied the Lync pool ID (retrieved from the dbo.OwnerPool table) for one of the groups, and tried again to add a member to the group. This time, it worked!
Resolution was then to populate all OwnerPoolID fields for each object with the ID value in the dbo.OwnerPool table, including groups and queues.
Error4
Not everyone might like the idea of fiddling around with raw databases and manually modify one by one. It all depends on how big and complex your RGs are, if you ever run into the same issue. Alternatively, you might just want to delete and recreate objects. In my case, newly created objects got the OwnerPoolID populated correctly.

MIGRATE CONFERENCING DIRECTORIES

As part of the migration process, conferencing directories must be migrated from OCS to Lync. If you follow TechNet verbatim, they will tell you to do it before deactivating and decommissioning the Lync pool (check http://technet.microsoft.com/en-us/library/jj205300.aspx). However, if you do it now, you will later experience the following error while deactivating the Conferencing Attendant component:
A call to a subtask failed.: The call to subtask AppServer.GetAppState failed. Pool is not ready.
To resolve this issue, temporary relocate the conferencing directory back from Lync to OCS 2007 R2 via the Move-CsConferenceDirectory cmdlet. The detailed process is greatly described in Lee Desmond’s blog: http://www.leedesmond.com/weblog/?p=749.

OCS 2007 R2 to Lync 2013 conference attendant migration exception: cannot set region via Lync Control Panel

On a recent migration from OCS 2007 R2 to Lync server 2013, I run through an unexpected and, apparently, undocumented issue while migrating the OCS conference attendant.

I used the following command to migrate the only conferencing attendant, and result seemed OK

Move-CsApplicationEndpoint -Identity sip:Microsoft.Rtc.Applications.Caa-E55ADF4E-AFE5-4480-915C-694773F4CFA8@contoso.com -Target lyncpool.contoso.com

A detailed procedure is available in this excellent blog: https://www.simple-talk.com/sysadmin/unified-messaging/migrating-from-ocs-2007-r2-to-lync---part-3/  

To confirm that the number was migrated to the new pool, I checked the Lync Server Control Panel; dial-in entry was there, but got a warning stating no region was assigned to it.

When trying to assign the region via the Lync Control Panel I got this error:

Set-CsDialInConferencingAccessNumber : The region mappings for this access number are inconsistent.
I was unable to find any public resource with the error above.

Resolution was to assign the region manually via Powershell and using the -ScopeToGlobal switch

Set-CsDialInConferencingAccessNumber -Identity "sip:Microsoft.Rtc.Applications.Caa-E55ADF4E-AFE5-4480-915C-694773F4CFA8@contoso.com" -Regions "UK" -ScopeToGlobal

Note that, until a region is specified, dial-in might work but no dial-in access number is populated into the email template when creating a new Lync meeting through Outlook.