DataStax Help Center

Nodes repeatedly reporting FailureDetector errors of "unknown endpoint" for decommissioned nodes

Summary

This article discusses failure detector errors on nodes which are no longer part of the cluster.

Symptoms

In the nodes' system.log, the following errors are repeatedly being reported:

ERROR [pool-10-thread-1] 2016-04-28 21:36:38,592  FailureDetector.java:223 - unknown endpoint /10.1.2.3
ERROR [pool-10-thread-1] 2016-04-28 21:36:43,660  FailureDetector.java:223 - unknown endpoint /10.1.2.3
ERROR [pool-10-thread-1] 2016-04-28 21:36:48,750  FailureDetector.java:223 - unknown endpoint /10.1.2.3
ERROR [pool-10-thread-1] 2016-04-28 21:36:53,815  FailureDetector.java:223 - unknown endpoint /10.1.2.3
ERROR [pool-10-thread-1] 2016-04-28 21:36:58,903  FailureDetector.java:223 - unknown endpoint /10.1.2.3

Cause

This issue applies to nodes which were previously running in Analytics mode but are no longer part of the cluster (internal bug ID DSP-9647).

For nodes running in DSE Analytics mode, the LeaderManagerWatcher.java checks if there has been an update on leaders every 5 seconds as part of the DSE Analytics high-availability feature. It makes a call to FailureDetector.java to check if all candidates (Analytics nodes) are alive.

The issue arises when an Analytics node which used to be a leader is no longer part of the cluster and the datacentre it belonged to does not exist anymore. When the node is decommissioned (and later its associated data centre), its entries are not cleaned out from the dse_system tables.

NOTE - This error does not have any adverse effect on the performance of the cluster but is simply generating unnecessary entries in the logs.

To illustrate, consider the following cluster with 2 data centres:

Datacenter: Primary
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.240.0.26  129.23 KB  1       ?       997f9c62-a014-4d05-8c0b-dd9707e52643  RAC1
UN  10.240.0.28  115.86 KB  1       ?       663f66a7-0f0d-490b-b2fa-c133db479ad6  RAC1
Datacenter: Wrong
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.240.0.29  100.62 KB  1       ?       0842d456-fb42-42df-b492-0fa8fa2f02f0  RAC1

The LeaderManager tables get populated as follows:

$ bin/cqlsh -e "SELECT * FROM dse_system.real_leaders"

 army     | datacenter | address
----------+------------+-------------
 HadoopJT |    Primary | 10.240.0.26
 HadoopJT |      Wrong | 10.240.0.29
$ bin/cqlsh -e "SELECT * FROM dse_system.registered_leaders"

 army     | datacenter | candidates | required
----------+------------+------------+----------
 HadoopJT |    Primary |       null |      50%
 HadoopJT |      Wrong |       null |      50%

If the node 10.240.0.29 is then decommissioned:

Datacenter: Primary
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens  Owns    Host ID                               Rack
UN  10.240.0.26  129.23 KB  1       ?       997f9c62-a014-4d05-8c0b-dd9707e52643  RAC1
UN  10.240.0.28  115.86 KB  1       ?       663f66a7-0f0d-490b-b2fa-c133db479ad6  RAC1

The LeaderManager tables still contain entries for the non-existent node and data centre:

$ bin/cqlsh -e "SELECT * FROM dse_system.real_leaders"

 army     | datacenter | address
----------+------------+-------------
 HadoopJT |    Primary | 10.240.0.26
 HadoopJT |      Wrong | 10.240.0.29
$ bin/cqlsh -e "SELECT * FROM dse_system.registered_leaders"

 army     | datacenter | candidates | required
----------+------------+------------+----------
 HadoopJT |    Primary |       null |      50%
 HadoopJT |      Wrong |       null |      50%

Workaround

Manually clean out the offending node's IP record in the tables of the dse_system keyspace as follows:

Step 1 - Check for the existence of the offending IP and DC.

cqlsh> SELECT * FROM dse_system.real_leaders ;
cqlsh> SELECT * FROM dse_system.registered_leaders ;

Step 2 - Delete records containing the IP address and DC.

cqlsh> DELETE FROM dse_system.real_leaders WHERE army = 'HadoopJT' AND datacenter = '<ghost_DC>' AND address = '<ghost_IP>' ;
cqlsh> DELETE FROM dse_system.registered_leaders WHERE army = 'HadoopJT' AND datacenter = '<ghost_DC>' ;

Solution

DSP-9647 has been fixed in DataStax Enterprise versions 4.7.9 and 4.8.8. Upgrade to these versions of DSE to obtain the fix.

See also

DataStax doc - Highly available Spark in DSE Analytics

Was this article helpful?
1 out of 1 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk