Wednesday 15 April 2015

Lync calls fail with long post-dial delay? Check the Edge!

I hope someone can benefit from the many hours I spent on this issue :)

Lync 2013 Redundant deployment. 3-node Enterprise pool. 2-node Edge servers. Public addresses on external interfaces. All OS are Windows Server 2012 R2. All Lync servers at latest CU as of February 2015. All infrastructure virtualised on VMware ESX 5.5.

Lync and PSTN calls suddenly could not be connected by external or internal endpoints. Clients received a call, call is answered, client hangs on "connecting..." state for some seconds, and then call is dropped.
Along with issue above, calls suddenly took a long time to be initiated (long post-dial delay). Whilst up to 2-3 "beeps" should be considered as normal, we experienced up to 8.

The sneaky nature of the issue was no apparent recurrence pattern. On average, we experienced the issue 4 times in around 3 weeks. Worth nothing saying, it was a hugely disruptive problem affecting about 10,000 users.

No related events logged on windows logs (checked on Front-End Servers, Edge, and Mediation)
  • Attempts to stop some Edge services (MediaRelaySvc.exe and MRASSvc.exe) resulted in services being stuck in “stopping” state indefinitely. And it was not possible to kill them
  • IM and presence still functional
  •  The only workaround to re-established functionality was rebooting both edge servers
  • Firewall, DNS and routing was thoroughly checked and the correct configuration was confirmed to be in place
Experienced issue was identical word-by-word, to the one described in this thread. All fixes suggested in the forum attempted without success.

Such failures can be usually narrowed-down to a few types of issues:
  1.  Firewall
  2.  Routing
  3. MRAS
After ensuring 1) and 2) were correct, we concentrated on MRAS; traces and Lync reports indeed provided evidence something was not quite right (candidates not exchanged, timeout on contacting MRAS resulting in endpoint being unable to obtain MRAS token).

After considerable digging, we found out the issue was triggered by two drivers: vShield Endpoint Thin Agent driver (vsepflt.sys) and vShield Endpoint TDI Manager driver (vnetflt.sys), both interacting at the network layer. Conclusive proof was provided by Microsoft PSS, by analysing a memory dump taken during a failure and Edge MRAS in hanging state (service stopping….).

VMware vShield Endpoint is required to manage anti-virus and anti-malware policies for virtualized environments. vShield Endpoint strengthens virtualization security with enhanced endpoint protection by offloading AV processing to a secure virtual appliance supplied by VMware partners. All servers in the deployment featured a file-level AV scanning, and the drivers were required as an agentless communication component between the virtual machines and VMware hosts.

such drivers were already known to cause stability issues, including BSOD (check this and this other post. Besides, they are not certified by Microsoft (at least, until the tested build).

Although we thought we were running a version fixing the issues described in the articles above, it seemed we hit a different type of bug which VMware fixed at a later date through an ad-hoc patch.
Our only other quick fix was to uninstall the drivers from the Lync servers completely. Simply disabling AV scanning or disabling the drivers did not help.

Low-level processes from third party applications can affect stability and reliability of Lync traffic. In our case it was even worse: Edge services were in hung state, causing Media Relay authentication to fail for all calls. Whilst a file-level antivirus scanner should be installed on any Lync server as a common security measure (with the correct exclusions), you should pay close attention to low-level additional components or third parties like:
  • Network-level inspection
  • IDS
  • Personal firewall add-ons
  • Network accelerators
  • Broadly speaking: any other network-level software may interfere with Lync traffic
Confirming their full compatibility will definitely save you some headaches.

I have experienced very similar issues on another deployment, this time, with McAfee antivirus. On that occasion, the trigger was the FireTDI driver (a host intrusion detection component).

1 comment:

  1. It’s very informative and you are obviously very knowledgeable in this area. You have opened my eyes to varying views on this topic with interesting and solid content. Actually I read it yesterday but I had some thoughts about it and today I wanted to read it again because it is very well written.