Summary
A new (or rebuilt) node added to the cluster is unable to communicate with other nodes. In some instances, the node was previously part of the cluster and is still unable to gossip when added back in.
Symptoms
One of the tell-tale signs of this issue is that the node reports in the system.log
that it is unable to determine the workload of other nodes in the cluster, for example:
WARN [main] 2015-10-08 00:15:56,370 Workload.java (line 100) Couldn't determine workload for /10.1.2.3 from value NULL WARN [main] 2015-10-08 00:15:56,371 Workload.java (line 100) Couldn't determine workload for /10.1.2.4 from value NULL WARN [main] 2015-10-08 00:15:56,372 Workload.java (line 100) Couldn't determine workload for /10.1.2.5 from value NULL
Other nodes are able to see the affected node as operational but the affected node itself is unable to gossip with other nodes. Here is a sample output of nodetool gossipinfo
:
/10.1.2.4 generation:0 heartbeat:0 /10.1.2.3 generation:0 heartbeat:0 /10.1.2.6 generation:1444263348 heartbeat:6232 LOAD:2.0293227179E10 INTERNAL_IP:10.26.81.97 DC:DC1 STATUS:NORMAL,-1041938454866204344 HOST_ID:36fdcf57-0274-43b8-a501-c0e475e3e30b X_11_PADDING:{"workload":"Cassandra","active":"true"} RPC_ADDRESS:10.26.81.97 RACK:RAC1 SCHEMA:ce2a34e3-0967-34ea-ad55-10270b805218 NET_VERSION:7 RELEASE_VERSION:2.0.12.275 SEVERITY:0.0 /10.1.2.5 generation:0 heartbeat:0
One other symptom is that the affected node sees all other nodes in the cluster belonging to another DC as shown in this sample nodetool status
output:
Datacenter: r1 ============== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack DN 10.1.2.5 ? 256 9.0% 5279619a-550c-42b3-8150-61ad24f828f3 r1 DN 10.1.2.3 ? 256 9.1% 5d1fa459-cdac-4658-b68d-c6e0933afcee r1 DN 10.1.2.4 ? 256 10.5% a8f35c63-6a76-4e95-99f1-bef65d785366 r1 Datacenter: DC1 =============== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.1.2.6 18.9 GB 256 9.5% 36fdcf57-0274-43b8-a501-c0e475e3e30b RAC1
Cause
The gossip protocol is used by the nodes to communicate information within the cluster. Gossip issues are usually related to problems with either snitch/topology configuration or the network layer.
In this case, the most common cause of the symptoms above are related to misconfigured firewall or VLANs.
Solution
Use the following checklist to identify the cause of the issue:
- check software firewall such as iptables or firewalld for misconfiguration
- check for missed steps in your organisation's server provisioning process - did security settings get inadvertently applied to the node?
- check ports on network devices for misconfiguration
- check network policies such as quality-of-service (QoS) or bandwidth throttling rules for misconfiguration - do they apply to this environment?
NOTE - The standard gossip TCP port is 7000, or 7001 for SSL-secured clusters.
See also
DataStax doc - Configuring firewall port access