SharePoint 2013 Distributed Cache Issues
Installing, configuring, and troubleshooting the finicky Distributed Cache
Recently I worked on several farms where distributed cache encountered many errors and would not start in the system correctly. I want to share what I did to fix and resolve this.
Issue 1: Ports
Requirement: If you can, make sure the ports needed for the Distributed Cache Service (DCS) are open and working among the servers BEFORE you join a server to a farm or create the farm.
Why: Because if these ports are blocked, the DCS which provisions the App Fabric Caching Service on the server, will not provision correctly and only one thing will fix this afterwards. The only thing you can do later is unjoin the failing server(s) from the farm and re-add them to the farm.
Scenario: Farm has 4 servers, 1 database, 1 application, 2 web front ends with NLB. The web front ends (WFEs) exist in a separate VLAN or protection domain with a firewall appliance separating the WFEs from the App tier and the database server. The firewalls block everything unless a rule exists to allow traffic. The rule must define what direction the traffic will be travelling, what ports, and between which zones (protection domains/VLANs). These were being blocked during setup of the farm. Later, errors exist with the app fabric service, the DCS, the service is stopped, and trying to run PowerShell commands fail. Open the ports, unjoin the WFEs from the farm, add them back to the farm, and the service works.
Ports: Distributed cache requires that communication between the farm servers (not database server) occur on the following ports:
Why: Because if these ports are blocked, the DCS which provisions the App Fabric Caching Service on the server, will not provision correctly and only one thing will fix this afterwards. The only thing you can do later is unjoin the failing server(s) from the farm and re-add them to the farm.
Scenario: Farm has 4 servers, 1 database, 1 application, 2 web front ends with NLB. The web front ends (WFEs) exist in a separate VLAN or protection domain with a firewall appliance separating the WFEs from the App tier and the database server. The firewalls block everything unless a rule exists to allow traffic. The rule must define what direction the traffic will be travelling, what ports, and between which zones (protection domains/VLANs). These were being blocked during setup of the farm. Later, errors exist with the app fabric service, the DCS, the service is stopped, and trying to run PowerShell commands fail. Open the ports, unjoin the WFEs from the farm, add them back to the farm, and the service works.
Ports: Distributed cache requires that communication between the farm servers (not database server) occur on the following ports:
Port Name | TCP/IP Port | Purpose |
---|---|---|
cache port | 22233 | The cache port is used for transmitting data between the cache hosts and your cache-enabled application. |
cluster port | 22234 | The cache hosts use the cluster port to communicate availability to each of their neighbors in the cluster. |
arbitration port | 22235 | If a cache host fails, the arbitration port is used to make certain that the cache host is unavailable. |
replication port | 22236 | The replication port is used to move data between hosts in the cache cluster. This supports features such as high availability and load balancing. |
Issue 2: Permissions Missing
Requirement: The config file for the DCS that tells the App Fabric Service what provider and database connection string to use must have a security account placed in the permissions that allows it to be read. This only happens when the ports are open and you add a server to a farm. I have tried to add this account later, but it is a security account, so nothing you can select and add.Config file for the DCS |
DCS Config File Permissions |
Note: As a side note if you need the provider name and connection string for some PowerShell commands, open this config file in Notepad.
PowerShell Commands
If you have a cache cluster, then you have more than one server in the farm. A single server farm with simply have a cache host, and obviously ports will not be a factor. It is when you want to add more than one server that it becomes a cluster. Operating on a cluster requires that you run different commands.In the scenario above I needed to gracefully shut down the DCS and remove servers from the farm. The best way to do this, is to gracefully stop the service and remove it from the cluster:
Login to the server you wish to remove, and open an elevated PowerShell session:
Stop DCS
Run this script to gracefully stop the service:
PowerShell Code
Stop-SPDistributedCacheServiceInstance -Graceful |
Remove DCS
Run this command to remove the server from the cluster:
PowerShell Code
Remove-SPDistributedCacheServiceInstance |
Remove Server From Farm
(if necessary to provision the permissions)
Option 1: Navigate to Central Administration, Servers. Click the link to Remove Server.
Option 2: Run a PowerShell command:
PowerShell Code
Disconnect-SPConfigurationDatabase –Confirm:$false |
Join Farm:
Option 1: Use SharePoint Products Configuration Wizard.
Option 2: Run a PowerShell command:
PowerShell Code
Connect-SPConfigurationDatabase -DatabaseServer "ServerName\InstanceName" -DatabaseName "SharePointConfigurationDatabaseName" -Passphrase (ConvertTo-SecureString "MyP@ssw0rd") -AsPlainText -Force Install-SPHelpCollection -All Initialize-SPResourceSecurity Install-SPService Install-SPFeature –AllExistingFeatures -Force Install-SPApplicationContent Start-Service SPTimerv4 |
Start DCS
This command run on every server is all you need to install, provision, and start the DCS for a cluster:
PowerShell Code
Add-SPDistributedCacheServiceInstance |
Services
See the services running and their Status:
PowerShell Code
Get-SPServiceInstance | ? {$_.TypeName -eq "Distributed Cache"} | ft Server, Status, Id -AutoSize |
If any service on a server is disabled, the cluster is not working. The best way to fix is to stop, remove the service on all servers with the commands above. Restart the servers. Re-run the Add command above on each server and rerun the services command above to see if they are online.
Other Commands
Start-CacheCluster
Start the Caching Service on all cache hosts in the cluster. Lead hosts are started first. If the Add-SPDistributedCacheServiceInstance works this shouldn't be necessary.
PowerShell Code
Use-CacheCluster Start-CacheCluster |
Get-CacheHostConfig
Run this with all server names in your farm to view each. Returns the configuration details of the specified Cache Host.
PowerShell Code
Get-CacheHostConfig -ComputerName <replace> -CachePort 22233 |
Sample Output of Get-CacheHostConfig |
Get-CacheClusterHealth
Return health statistics for all of the named caches in the cache cluster. This includes those that haven't been allocated yet.
PowerShell Code
Get-CacheClusterHealth |
Sample Output of Get-CacheClusterHealth |
Get-CacheHost
List all cache host services that are members of the cache cluster.
Cache Host Cluster is Null Error
This will prevent you from successfully running the Stop-SPDistributedCacheServiceInstance command. In this case we need to delete the service more forcefully:
PowerShell Code
#Get the service IDs Get-SPServiceInstance | ? {$_.TypeName -eq "Distributed Cache"} | ft Server, Status, Id -AutoSize #Run this command for each GUID: $s = Get-SPServiceInstance 73465f0c-4122-4bd7-a218-b7e52aaf8d6d $s.delete() |
Get-Cache
Lists all caches and regions in the cluster, and the cache host where each region resides. Without any parameters, all the cluster caches and their host-region details are returned. With Hostname and CachePort parameters provided, caches and region details are returned only for the specified host.
Comments
Post a Comment