Thursday 30 April 2015

Skype for Business and Lync 2013 DDC - Detailed Design Calculator 5.0

DDC
Just in time for Skype For Business server release, friend and co-author Alberto Nunes and I are very happy to announce Skype for Business DDC - Detailed Design Calculator 5.0. It adds a lot of long-awaited features (keep reading for details). Please grab it from here. Please take a moment to rate us. Thank you! :-)

DDC is a simple offline, Excel-based, low-level design calculator for Microsoft Skype for Business and Lync 2013 on-premises deployments. Fill in host names, IP addresses etc., and DDC will calculate DNS records, certificate names, firewall rules, deployment scripts and several other design elements to help speed-up your deployment.
DDC is a continuously evolving project and you should expect frequent updates with new features added over time. It is and will always be free.
Any bug report, requests for improvements or new features, suggestions, criticism etc. are greatly appreciated.

Features

New in version 5.0.0:

  • Support for multiple SIP domains (up to 8), each with several configurable options (strict domain matching, SIP/XMPP federation, etc.)
  • DNS tables reorganised for multiple domain support (records are sorted by domain).
  • Added the ability to include/exclude external servers (Edge pool, Reverse Proxy) from the deployment
  • Added the ability to choose AD or main SIP domain name for Pools and Web Services names (earlier DDC versions used primary SIP domain name by default and did not allow to change)
  • Option to deploy 1 or 2 network cards on dedicated Mediation servers (separated internal and PSTN IP addresses). We have not included such option for collocated Mediation (not recommended)
  • Option to use separate public IP addresses for external web-based applications (FE and Director pool web services, Office Web Apps)
  • Support for Office Web Apps farms (up to 6 nodes)
  • Additional scripts (setup accounts for synthetic transactions)
  • Required empty input cells displayed in red
Other:
  • Supports Standard and Enterprise pools (up to 12 nodes), with pure device-based load balancing (HLB) or a combination of DNS load balancing and device-based load balancing for web services (DNS LB);
  • Supports Edge, Director and Mediation pools (up to 12 nodes per role) with HLB or DNS LB;
  • Supports up to 4 PSTN gateways (can be media gateways, direct SIP trunks, etc.) with or without media bypass and configurable SIP over TCP or TLS, media ports, media bypass
  • Ability to specify custom media ports for clients and servers. DDC automatically applies consecutive non-overlapping ranges, and creates the appropriate commands on the Scripts sheet to apply these on to your deployment;
  • Calculates internal and external certificates CN and SAN; for the external certificate, it provides the option of separate or single certificate for Edge and Reverse proxy;
  • Calculates DNS entries for internal and external zones. Further to that, DDC generates a script (in the Scripts sheet) which will automatically add the required records for both pinpoint or split-brain DNS.
  Calculates firewall rules for: 
  • Internal firewalls (internal-facing DMZ); 
  • External firewalls (external-facing DMZ);
  • PSTN: The PSTN firewall sheet calculates custom rules for firewalls behind PSTN gateways, if any;
  • Endpoints: the internal client Firewall sheet calculates custom rules for personal firewalls installed on clients, and rules required in scenarios with endpoints segregated by VLANs or other restrictions in place.
  Script section: new in version 4.x and still at an initial stage, but with plans to grow it over time. Currently it supports:
  • DHCP: We have included a modified version of DhcpConfigScript.bat (http://technet.microsoft.com/en-us/library/gg412988(v=ocs.14).aspx), with the correct hexadecimal values automatically calculated and included, based on your design inputs; this removes the requirement to use dhcputil to generate the script and makes it ready to run on x86 and x64 Windows DHCP servers;
  • DNS: Scripts to create necessary records based on dnscmd (with both pinpoint and split-brain, based on your selection);
  • Office Web Apps: scripts to automate certificate request (through certutil), installation and web farm creation;
  • Forward Proxy exceptions;
  • QoS: PowerShell commands to configure custom port ranges;
  • Setup accounts for synthetic transactions.

How to use 

The tool was tested on Microsoft Excel 2013 and 2010 for Windows desktop (Excel online is not supported and we have not tested it on Office for Mac). Macros and active content must be enabled. Fill all relevant fields in Global DataResource Data and Other Data sheets. All input cells with dynamic data have a dark blue background and are already populated with sample data. Please change all values to reflect your actual design. Empty or invalid entries will be marked red.
Important: remember to press the Generate Data button (available in all sheets) when data input is complete (or when you change a value). This is required to refresh and resize calculations and views. Please do not manually resize, hide or unhide rows. This is done programmatically when you press the Generate Data button so that only the relevant content is displayed.

 

Known issues

  • When you press the Generate Data button, you may notice some screen flickering through the DDC sheets. This is due to the recalculations of data and views. On slower machines, it can take some time for refresh to complete;
  • An issue in December 2014 Excel update MS14-082 may break DDC functionality in some circumstances. You may notice pull-down menus not updating, Generate Data button not working, etc. This is due to a problem described in the following article (check section known issues with this security update). Hotfixes have been released in the March 2015 Updates for Office 2007, 2010 & 2013. Refer to the articles below for more details and any pre-requisites: Microsoft Support and Microsoft Microsoft Excel Support Team Blog.

Disclaimer

DDC is a third party tool developed by independent Microsoft UC Solutions Architects. Authors are not affiliated with Microsoft. Skype for Business,™, Skype™, Lync™, Office Communications Server™, Exchange™ and Excel™ are registered trademarks of the Microsoft Corporation™. Although we took every care in calculations and scripts, use at own risk! Please read the extended disclaimer on the file.


Assumptions and limitations

Note the following assumptions or known limitations (some of which will be addressed in future versions): 
  • Very limited content validation and error catching: You will not be warned if you type 256.256.256.256 as IP address :-) Ensure you type the correct data;
  • DDC currently has no sizing/capacity calculator features. It assumes you already have made your determinations on number and types of servers to implement;
  • Only IPv4 is supported;
  • Single Pool per each role is supported;
  • Single Reverse Proxy;
  • Edge, Director and Mediation, even when 1 node is selected, are always configured in a Pool to allow for easier scalability and certificate management; 
  • When HLB (device-based load balancing) is selected for Front-End and Director Pools, we assume that internal web services host names will not be overridden. This is optional when HLB is used; override becomes mandatory with DNS LB;
  • We assume internal server resources will use an internal Certification authority. This includes Front-end, Mediation, Director and internal Edge and reverse proxy interfaces. Firewall rules are included to grant DMZ servers (Edge and reverse proxy) access to the CRL;
  • If the same domain name is used for Active Directory and SIP, in a multiple SIP domain deployment, we assume this will be the primary (default).

Credits

For beta testing, bug report, suggestions, feedback and other valuable input: Corey McClain (@cdhtweetstech), Lasse Nordvik Wedø (@lawedo), Antonio Spirandelli (@spady7), Dino Caputo (@dinocaputo), Fabrizio Volpe (@fabriziovlp), MaxSanna (@MaxSanna) Igor Kravchenko, Korbyn, Lutenus, Mauro Rita (@jmrita), Thomas Juhl Olesen, Wilfried van Oosterhout, Pat Richard (@patrichard), Daniel Banfield, James Brewster.
 

Version history

Version 5.0.0 - 1st May, 2015
New features:
1) Support for multiple SIP domains (up to 8), each with several configurable options (strict domain matching, SIP/XMPP federation, etc.)
2) DNS tables reorganised for multiple domain support (records are sorted by domain).
3) Added the ability to include/exclude Edge pool in the deployment
4) Added the ability to choose AD or main SIP domain name for Pools and Web Services names (earlier DDC versions used primary SIP domain name by default and did not allow to change)
5) Option to deploy 1 or 2 network cards on dedicated Mediation servers (separated internal and PSTN IP addresses). We have not included such option for collocated Mediation (not recommended)
6) Option to use separate public IP addresses for external web-based applications (FE and Director pool web services, Office Web Apps)
7) Support for Office Web Apps farms (up to 6 nodes)
8) Additional scripts (synthetic transactions)
9) Cells with empty or invalid entries are marked red
Bug fix:
1) Some naming inconsistencies
2) Visual improvements (smaller fonts for better readability on higher res) - refresh issues
3) Numerous optimisations on scripts and code (for many suggestions on scripts: thanks @PatRichard)
4) Some issues on firewall sheets (missing rules for Directors)

Version 4.3.1 - 13th April, 2015
New features: none
Bug fix: Lyncdiscover record was displayed in internal DNS in some instances. (thanks Fgarib)

Version 4.3 - 9th November, 2014
New features: none
Bug fix:
1) Missing rule in Firewall (external) for tcp/443 on A/V Edge server
2) Missing rule in Firewall (external) for tcp/80 on Reverse Proxy

Version 4.2 – 7th September, 2014
New features: visual improvements - added extended disclaimer
Bug fix:
1) incorrect implementation of RFC3361 (http://www.rfc-editor.org/rfc/rfc3361.txt) caused the DHCPUTIL script to generate an incorrect string for option 120 (row 11 in Scripts sheet). Thanks to Daniel Banfield for reporting the issue
2) bug in DHCP script generating the correct entry for Lync internal web services depending on load balancing method
3) various scripts optmisations and some typos (thanks @patrichard)

Version 4.1.1 – 4th September, 2014
New features: adds an entry for lyncdiscover in DNS internal sheet (required in specific scenarios where Windows Phone 8.x devices are unable to sign in from a corporate WiFi (thanks to @patrichard for the input). More info at http://jackstromberg.com/2013/06/lync-2013-dns-settings/
Bug fix: on Firewall (internal) sheet, the edge pool FQDN was displayed in a rule (should have been the FE pool FQDN)

Version 4.1 – 29th August, 2014
New features: visual improvements
Bug fix: Internal Office Web Apps certificate missed physical server name in the SAN (without it, the farm always shows as unhealthy). Thanks to @patrichard for notifying
 
Version 4.0.4 – 2nd August, 2014
New features: none
Bug fix: naming conventions - typos - some inaccurate error catching

Version 4.0.3 – 19th April, 2014
New features: none
Bug fix: additional refresh issues. Thanks to Wilfried van Oosterhout for notifying.

Version 4.0.2 – 13th April, 2014
New features: none
Bug fix: some issues in the hide/show procedures causing some entries to be incorrectly displayed (Director Web services, PSTN gateways and other). Thanks to Wilfried van Oosterhout for notifying.

Version 4.0.1 – 22nd March, 2014
New features: none
Bug fix: Inconsistencies in Office Web Apps / Office Online naming conventions + improvements on scripts descriptions
 
Version 4.0.0 – 21st March, 2014
New features: see the Features section for a full overview of existing and new features
Bug fix: Several visual issues, optimisations

Version 3.0.2 - 13th February, 2014
New features: none
Bug fix: minor code bug causing resource data refresh issues when changing Mediation pool type

Version 3.0.1 - 11th February, 2014
New features: none
Bug fix: formula issue in PSTN firewall sheet caused some IP addresses to display incorrectly (thanks Lutenus for notifying)

Version 3.0 - 10th February, 2014
New features: Support for PSTN Gateways, Mediation and Director Pools
Bug fix: several bug fixes, code and visual improvements

Version 2.0.4 - 23rd December, 2013
New features: none
Bug fix:
1) fix an issue where formula was displayed in some cells instead of result
2) On a standard edition pool, first SAN entry was not correct (should have been a reiteration of CN)

Wednesday 15 April 2015

Lync calls fail with long post-dial delay? Check the Edge!

I hope someone can benefit from the many hours I spent on this issue :)

INFRASTRUCTURE
Lync 2013 Redundant deployment. 3-node Enterprise pool. 2-node Edge servers. Public addresses on external interfaces. All OS are Windows Server 2012 R2. All Lync servers at latest CU as of February 2015. All infrastructure virtualised on VMware ESX 5.5.

ISSUE DESCRIPTION
Lync and PSTN calls suddenly could not be connected by external or internal endpoints. Clients received a call, call is answered, client hangs on "connecting..." state for some seconds, and then call is dropped.
Along with issue above, calls suddenly took a long time to be initiated (long post-dial delay). Whilst up to 2-3 "beeps" should be considered as normal, we experienced up to 8.

The sneaky nature of the issue was no apparent recurrence pattern. On average, we experienced the issue 4 times in around 3 weeks. Worth nothing saying, it was a hugely disruptive problem affecting about 10,000 users.


OTHER INFORMATION AND THINGS CHECKED
No related events logged on windows logs (checked on Front-End Servers, Edge, and Mediation)
  • Attempts to stop some Edge services (MediaRelaySvc.exe and MRASSvc.exe) resulted in services being stuck in “stopping” state indefinitely. And it was not possible to kill them
  • IM and presence still functional
  •  The only workaround to re-established functionality was rebooting both edge servers
  • Firewall, DNS and routing was thoroughly checked and the correct configuration was confirmed to be in place
Experienced issue was identical word-by-word, to the one described in this thread. All fixes suggested in the forum attempted without success.

ANALYSIS
Such failures can be usually narrowed-down to a few types of issues:
  1.  Firewall
  2.  Routing
  3. MRAS
After ensuring 1) and 2) were correct, we concentrated on MRAS; traces and Lync reports indeed provided evidence something was not quite right (candidates not exchanged, timeout on contacting MRAS resulting in endpoint being unable to obtain MRAS token).




ROOT CAUSE
After considerable digging, we found out the issue was triggered by two drivers: vShield Endpoint Thin Agent driver (vsepflt.sys) and vShield Endpoint TDI Manager driver (vnetflt.sys), both interacting at the network layer. Conclusive proof was provided by Microsoft PSS, by analysing a memory dump taken during a failure and Edge MRAS in hanging state (service stopping….).

WHAT DO THE DRIVERS DO
VMware vShield Endpoint is required to manage anti-virus and anti-malware policies for virtualized environments. vShield Endpoint strengthens virtualization security with enhanced endpoint protection by offloading AV processing to a secure virtual appliance supplied by VMware partners. All servers in the deployment featured a file-level AV scanning, and the drivers were required as an agentless communication component between the virtual machines and VMware hosts.

RESOLUTION
such drivers were already known to cause stability issues, including BSOD (check this and this other post. Besides, they are not certified by Microsoft (at least, until the tested build).



Although we thought we were running a version fixing the issues described in the articles above, it seemed we hit a different type of bug which VMware fixed at a later date through an ad-hoc patch.
Our only other quick fix was to uninstall the drivers from the Lync servers completely. Simply disabling AV scanning or disabling the drivers did not help.

TAKEAWAY
Low-level processes from third party applications can affect stability and reliability of Lync traffic. In our case it was even worse: Edge services were in hung state, causing Media Relay authentication to fail for all calls. Whilst a file-level antivirus scanner should be installed on any Lync server as a common security measure (with the correct exclusions), you should pay close attention to low-level additional components or third parties like:
  • Network-level inspection
  • IDS
  • Personal firewall add-ons
  • Network accelerators
  • Broadly speaking: any other network-level software may interfere with Lync traffic
Confirming their full compatibility will definitely save you some headaches.

OTHER
I have experienced very similar issues on another deployment, this time, with McAfee antivirus. On that occasion, the trigger was the FireTDI driver (a host intrusion detection component).