This article discusses an issue where nodes are unable to gossip with other nodes which have been running for a long time.
A cluster with long-running DSE processes reports nodes as down (e.g.
nodetool status output) after DSE has been restarted. The following sample warning is reported in the
WARN [GossipStage:1] 2017-03-01 13:28:19,589 Gossiper.java:1105 - received an invalid gossip generation for peer /10.1.2.3; local generation = 1455182503, received generation = 1488365632
In CASSANDRA-8113, gossip with generation numbers set too far into the future (i.e. corrupted gossip from a node) is ignored to prevent the corruption from bringing down the rest of the cluster. This enhancement was implemented in Apache Cassandra 2.1.1 onwards.
This behaviour inadvertently prevented long-running nodes from gossiping with nodes which have just been restarted or joined the cluster. More explicitly when the difference between a node's gossip and the generation received from another node exceeds the one-year threshold as identified in CASSANDRA-10969.
Perform a rolling restart of all nodes in all data centres to force the nodes' gossip generation to reset to a lower value.
NOTE - In some situations, newly restarted nodes would have already gossiped with nodes which still have really old generations and end up "contaminating" the gossip pool before they are restarted. It will be necessary to perform several rolling restarts until old generations are purged from gossip.
Cassandra JIRA - CASSANDRA-8113 Gossip should ignore generation numbers too far in the future
Cassandra JIRA - CASSANDRA-10969 Fix bad gossip generation seen in long-running clusters