SharePoint 2013 Distributed Cache Issues

Installing, configuring, and troubleshooting the finicky Distributed Cache

Recently I worked on several farms where distributed cache encountered many errors and would not start in the system correctly.  I want to share what I did to fix and resolve this.

Issue 1: Ports

Requirement: If you can, make sure the ports needed for the Distributed Cache Service (DCS) are open and working among the servers BEFORE you join a server to a farm or create the farm.

Why: Because if these ports are blocked, the DCS which provisions the App Fabric Caching Service on the server, will not provision correctly and only one thing will fix this afterwards.  The only thing you can do later is unjoin the failing server(s) from the farm and re-add them to the farm.

Scenario:  Farm has 4 servers, 1 database, 1 application, 2 web front ends with NLB.  The web front ends (WFEs) exist in a separate VLAN or protection domain with a firewall appliance separating the WFEs from the App tier and the database server.  The firewalls block everything unless a rule exists to allow traffic.  The rule must define what direction the traffic will be travelling, what ports, and between which zones (protection domains/VLANs).  These were being blocked during setup of the farm.  Later, errors exist with the app fabric service, the DCS, the service is stopped, and trying to run PowerShell commands fail.  Open the ports, unjoin the WFEs from the farm, add them back to the farm, and the service works.

Ports: Distributed cache requires that communication between the farm servers (not database server) occur on the following ports:

Port NameTCP/IP PortPurpose
cache port22233The cache port is used for transmitting data between the cache hosts and your cache-enabled application.
cluster port22234The cache hosts use the cluster port to communicate availability to each of their neighbors in the cluster.
arbitration port22235If a cache host fails, the arbitration port is used to make certain that the cache host is unavailable.
replication port22236The replication port is used to move data between hosts in the cache cluster. This supports features such as high availability and load balancing.


Issue 2: Permissions Missing

Requirement: The config file for the DCS that tells the App Fabric Service what provider and database connection string to use must have a security account placed in the permissions that allows it to be read.  This only happens when the ports are open and you add a server to a farm.  I have tried to add this account later, but it is a security account, so nothing you can select and add.

Config file for the DCS
This config file exists on every SharePoint farm server in the directory of the image above.  Right-Click and Select Properties, Security, Advanced security:
DCS Config File Permissions
In the above scenario where the ports were being blocked between the WFEs and the application server, the above AppFabricCachingService principal was missing.  Also the owner of this object was not the Administrators group that you see above.  I determined that the only time this permissioning happens is when the server joins the farm.  Even opening the ports and running commands to provision the DCS would not fix the permissions.

Note: As a side note if you need the provider name and connection string for some PowerShell commands, open this config file in Notepad.

PowerShell Commands

If you have a cache cluster, then you have more than one server in the farm.  A single server farm with simply have a cache host, and obviously ports will not be a factor.  It is when you want to add more than one server that it becomes a cluster.  Operating on a cluster requires that you run different commands.

In the scenario above I needed to gracefully shut down the DCS and remove servers from the farm.  The best way to do this, is to gracefully stop the service and remove it from the cluster:

Login to the server you wish to remove, and open an elevated PowerShell session:

Stop DCS
Run this script to gracefully stop the service:
PowerShell Code
Stop-SPDistributedCacheServiceInstance -Graceful

Remove DCS
Run this command to remove the server from the cluster:
PowerShell Code
Remove-SPDistributedCacheServiceInstance

Remove Server From Farm
(if necessary to provision the permissions)
Option 1: Navigate to Central Administration, Servers.  Click the link to Remove Server.
Option 2: Run a PowerShell command:
PowerShell Code
Disconnect-SPConfigurationDatabase –Confirm:$false

Join Farm:
Option 1: Use SharePoint Products Configuration Wizard.
Option 2: Run a PowerShell command:
PowerShell Code
Connect-SPConfigurationDatabase -DatabaseServer "ServerName\InstanceName" -DatabaseName "SharePointConfigurationDatabaseName" -Passphrase (ConvertTo-SecureString "MyP@ssw0rd") -AsPlainText -Force

Install-SPHelpCollection -All
Initialize-SPResourceSecurity
Install-SPService
Install-SPFeature –AllExistingFeatures -Force
Install-SPApplicationContent
Start-Service SPTimerv4

Start DCS
This command run on every server is all you need to install, provision, and start the DCS for a cluster:
PowerShell Code
Add-SPDistributedCacheServiceInstance

Services
See the services running and their Status:
PowerShell Code
Get-SPServiceInstance | ? {$_.TypeName -eq "Distributed Cache"} | ft Server, Status, Id -AutoSize

If any service on a server is disabled, the cluster is not working.  The best way to fix is to stop, remove the service on all servers with the commands above.  Restart the servers.  Re-run the Add command above on each server and rerun the services command above to see if they are online.

Other Commands

Start-CacheCluster
Start the Caching Service on all cache hosts in the cluster. Lead hosts are started first.  If the Add-SPDistributedCacheServiceInstance works this shouldn't be necessary.
PowerShell Code
Use-CacheCluster
Start-CacheCluster


Get-CacheHostConfig
Run this with all server names in your farm to view each.  Returns the configuration details of the specified Cache Host.
PowerShell Code
Get-CacheHostConfig -ComputerName <replace> -CachePort 22233

Sample Output of Get-CacheHostConfig

Get-CacheClusterHealth
Return health statistics for all of the named caches in the cache cluster. This includes those that haven't been allocated yet.
PowerShell Code
Get-CacheClusterHealth
Sample Output of Get-CacheClusterHealth

Get-CacheHost
List all cache host services that are members of the cache cluster.
PowerShell Code
Get-CacheHost
Sample Output of Get-CacheHost

Cache Host Cluster is Null Error
This will prevent you from successfully running the Stop-SPDistributedCacheServiceInstance command.  In this case we need to delete the service more forcefully:
PowerShell Code
#Get the service IDs
Get-SPServiceInstance | ? {$_.TypeName -eq "Distributed Cache"} | ft Server, Status, Id -AutoSize

#Run this command for each GUID:
$s = Get-SPServiceInstance 73465f0c-4122-4bd7-a218-b7e52aaf8d6d

$s.delete()
Then run the Add-SPDistributedCacheServiceInstance from above for each server to provision it again.  If the cache cluster is still null or other commands of the cluster fail, you will need to see if ports are blocking or unjoin/rejoin the servers that are failing.


Get-Cache
Lists all caches and regions in the cluster, and the cache host where each region resides. Without any parameters, all the cluster caches and their host-region details are returned. With Hostname and CachePort parameters provided, caches and region details are returned only for the specified host.
PowerShell Code
get-cache | ft -autosize
Sample Output of Get-Cache

Comments

Popular posts from this blog

SharePoint Designer 2013 Approval Workflow with Comments

Change SharePoint server hostname and Web Application Names

The Timer Service Failed to Recycle