Summary
This article discusses an issue where hints replay prevents DataStax Enterprise from starting on a node.
Applies to
- DataStax Enterprise 6.7
- DataStax Enterprise 6.0
- DataStax Enterprise 5.1
- DataStax Enterprise 5.0
Symptom
Attempts to restart DSE on a node fails with errors reported in the logs relating to failures to read a hints file for replay. Here is a sample entry in the system.log
of a DSE 5.0.10 node:
WARN [HintsDispatcher:1] 2019-04-03 02:07:50,165 HintsReader.java:329 - Failed to read a hint for cbe51351-f538-46a0-9f33-ddb8568a702e - digest mismatch for hint at position 52425457 in file cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints ERROR [HintsDispatcher:1] 2019-04-03 02:07:50,166 HintsDispatchExecutor.java:231 - Failed to dispatch hints file cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints: file is corrupted ({}) org.apache.cassandra.io.FSReadError: java.io.IOException: Digest mismatch exception at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:296) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:261) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.sendHints(HintsDispatcher.java:157) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.sendHintsAndAwait(HintsDispatcher.java:138) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.dispatch(HintsDispatcher.java:123) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.dispatch(HintsDispatcher.java:95) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] ... Caused by: java.io.IOException: Digest mismatch exception at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNextInternal(HintsReader.java:313) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:287) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] ...
As a result of the failure, startup fails and DSE shuts down:
ERROR [HintsDispatcher:1] 2019-04-03 02:07:50,167 StorageService.java:424 - Stopping gossiper WARN [HintsDispatcher:1] 2019-04-03 02:07:50,168 StorageService.java:315 - Stopping gossip by operator request INFO [HintsDispatcher:1] 2019-04-03 02:07:50,168 Gossiper.java:1538 - Announcing shutdown INFO [HintsDispatcher:1] 2019-04-03 02:07:50,168 StorageService.java:2181 - Node /10.1.2.3 state jump to shutdown ERROR [HintsDispatcher:1] 2019-04-03 02:07:52,169 StorageService.java:429 - Stopping RPC server INFO [HintsDispatcher:1] 2019-04-03 02:07:52,169 ThriftServer.java:142 - Stop listening to thrift clients ERROR [HintsDispatcher:1] 2019-04-03 02:07:52,169 StorageService.java:434 - Stopping native transport INFO [HintsDispatcher:1] 2019-04-03 02:07:52,172 Server.java:180 - Stop listening for CQL clients
Cause
The implementation for Cassandra hints storage was rewritten in Cassandra 3.0 (CASSANDRA-6230). For DSE 5.0 and later releases, hints data which used to be stored in the system.hints
table is now stored in files.
There are several reasons for hints files to get corrupted. The most common causes are:
- Hardware failure
- Bad disk region preventing reads
- DSE process getting terminated with a
kill
signal - Linux
oom-killer
- File system running out of disk space
In all the scenarios above except for bad disks, the hints file does not get completely "fsynced" to the disk because the file system write was interrupted. For this reason, the digest (think checksum) of the hints file does not match the expected value in the corresponding CRC file (the CRC file is a Cyclical Redundancy Check checksum file used to verify data integrity).
Workaround
A corrupt hints file cannot be recovered directly. Its contents (mutations which did not make it to one or more replicas) can be "sourced" from other replicas in the cluster with nodetool repairs.
Step 1 - Run a shutdown of DSE to ensure there are no orphaned DSE threads still running on the affected node.
Step 2 - Check that there is no Java process bound to DSE ports still running using the following command:
$ sudo lsof -i -n -P | grep LISTEN | grep java
NOTE - If the OpsCenter agent is installed, a Java process will show up as being bound to port 61621
(default).
NOTE - The ports used by DSE vary based on the major release version. For example, the ports used by DSE 5.0 are documented in Configuring firewall port access. Check the document specific to the DSE version installed on the node.
Step 3 - Delete the problematic hints file in the hints_directory
. By default, the hints file is in /var/lib/cassandra/hints
.
Using the hints file identified in the example error message above, delete the offending hints file and its corresponding CRC file as follows:
$ cd /var/lib/cassandra/hints $ rm cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints $ rm cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.crc
Step 4 - Start DSE on the node.
Step 5 - Monitor the progress of the startup as follows:
$ tail -f /var/log/cassandra/system.log
POST - After the node is back online, you must synchronize all the replicas by running nodetool repair -pr
, one node at a time until all nodes in all data centers have been repaired.
See also
DataStax Docs - How is data written?
DataStax Docs - Selecting hardware for DataStax Enterprise implementations: Disk Space
DataStax Docs - Hinted Handoff: repair during write path
KB article - Hints file with unknown CFID can cause nodes to fail