Summary
Nodes are intermittently unable to gossip properly when configured with the GossipingPropertyFileSnitch
.
Symptoms
One symptom is nodes randomly going up and down for no apparent reason.
INFO [GossipTasks:1] 2016-04-29 02:47:32,559 Gossiper.java:1001 - InetAddress /10.1.2.3 is now DOWN INFO [GossipTasks:1] 2016-04-29 02:50:47,123 Gossiper.java:1001 - InetAddress /10.1.2.4 is now DOWN INFO [GossipTasks:1] 2016-04-29 02:54:59,640 Gossiper.java:1001 - InetAddress /10.1.2.5 is now DOWN INFO [SharedPool-Worker-2] 2016-04-29 03:01:23,828 Gossiper.java:987 - InetAddress /10.1.2.4 is now UP INFO [SharedPool-Worker-1] 2016-04-29 03:01:59,432 Gossiper.java:987 - InetAddress /10.1.2.5 is now UP INFO [SharedPool-Worker-7] 2016-04-29 03:02:01,839 Gossiper.java:987 - InetAddress /10.1.2.3 is now UP
Similarly, different nodes appear to be down in the nodetool status
output depending on where it was ran, for example:
Datacenter: Cassandra ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack DN 10.1.2.3 8.97 GB 256 ? a50dfef5-229d-4d15-89d9-971bec01094b rack1 UN 10.1.2.5 8.9 GB 256 ? a16b71a2-9b95-4669-a6bd-d7326bd279e2 rack1 DN 10.1.2.4 9.09 GB 256 ? ac01b6f9-3cb9-47ff-83c6-0404836386eb rack1 UN 10.1.2.6 10.65 GB 256 ? 9c0ef3a2-aad7-4d06-b015-f32ddccac750 rack1
Cause
The problem has only been seen in a very small number of clusters and is still under investigation. However, it has been identified that the problem occurs when the cassandra-topology.properties
exists while nodes are configured with GossipingPropertyFileSnitch
:
INFO [main] 2016-04-29 15:31:26,039 GossipingPropertyFileSnitch.java:71 - Loaded cassandra-topology.properties for compatibility
It is important to note that the issue is very intermittent and not all vectors which trigger the problem are known yet.
Workaround
By design, the GossipingPropertyFileSnitch
falls back on the PropertyFileSnitch
's cassandra-topology.properties
as a means to allow clusters to be migrated to GossipingPropertyFileSnitch
.
If the cluster is already on GossipingPropertyFileSnitch
, ensure that cassandra-topology.properties
has been removed or does not exist even if there are no issues with the nodes to ensure the cluster does not encounter problems in the future.
See also
DataStax doc - Snitch - GossipingPropertyFileSnitch
Cassandra JIRA - CASSANDRA-11508 GPFS property file should more clearly explain the relationship with PFS