DataStax Help Center

New node in cluster unable to gossip, cannot determine workload of other nodes

Summary

A new (or rebuilt) node added to the cluster is unable to communicate with other nodes. In some instances, the node was previously part of the cluster and is still unable to gossip when added back in.

Symptoms

One of the tell-tale signs of this issue is that the node reports in the system.log that it is unable to determine the workload of other nodes in the cluster, for example:

 WARN [main] 2015-10-08 00:15:56,370 Workload.java (line 100) Couldn't determine workload for /10.1.2.3 from value NULL
 WARN [main] 2015-10-08 00:15:56,371 Workload.java (line 100) Couldn't determine workload for /10.1.2.4 from value NULL
 WARN [main] 2015-10-08 00:15:56,372 Workload.java (line 100) Couldn't determine workload for /10.1.2.5 from value NULL

Other nodes are able to see the affected node as operational but the affected node itself is unable to gossip with other nodes. Here is a sample output of nodetool gossipinfo:

/10.1.2.4
  generation:0
  heartbeat:0
/10.1.2.3
  generation:0
  heartbeat:0
/10.1.2.6
  generation:1444263348
  heartbeat:6232
  LOAD:2.0293227179E10
  INTERNAL_IP:10.26.81.97
  DC:DC1
  STATUS:NORMAL,-1041938454866204344
  HOST_ID:36fdcf57-0274-43b8-a501-c0e475e3e30b
  X_11_PADDING:{"workload":"Cassandra","active":"true"}
  RPC_ADDRESS:10.26.81.97
  RACK:RAC1
  SCHEMA:ce2a34e3-0967-34ea-ad55-10270b805218
  NET_VERSION:7
  RELEASE_VERSION:2.0.12.275
  SEVERITY:0.0
/10.1.2.5
  generation:0
  heartbeat:0

One other symptom is that the affected node sees all other nodes in the cluster belonging to another DC as shown in this sample nodetool status output:

Datacenter: r1 
============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address Load Tokens Owns Host ID Rack 
DN 10.1.2.5 ? 256 9.0% 5279619a-550c-42b3-8150-61ad24f828f3 r1 
DN 10.1.2.3 ? 256 9.1% 5d1fa459-cdac-4658-b68d-c6e0933afcee r1 
DN 10.1.2.4 ? 256 10.5% a8f35c63-6a76-4e95-99f1-bef65d785366 r1 
Datacenter: DC1 
=============== 
Status=Up/Down 
|/ State=Normal/Leaving/Joining/Moving 
-- Address Load Tokens Owns Host ID Rack 
UN 10.1.2.6 18.9 GB 256 9.5% 36fdcf57-0274-43b8-a501-c0e475e3e30b RAC1

Cause

The gossip protocol is used by the nodes to communicate information within the cluster. Gossip issues are usually related to problems with either snitch/topology configuration or the network layer. 

In this case, the most common cause of the symptoms above are related to misconfigured firewall or VLANs.

Solution

Use the following checklist to identify the cause of the issue:

  • check software firewall such as iptables or firewalld for misconfiguration
  • check for missed steps in your organisation's server provisioning process - did security settings get inadvertently applied to the node?
  • check ports on network devices for misconfiguration
  • check network policies such as quality-of-service (QoS) or bandwidth throttling rules for misconfiguration - do they apply to this environment?

NOTE - The standard gossip TCP port is 7000, or 7001 for SSL-secured clusters.

See also

DataStax doc - Configuring firewall port access

Was this article helpful?
2 out of 2 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk