ERROR Message Example
WARN [HintsDispatcher:1] 2019-04-03 02:07:50,165 HintsReader.java:329 - Failed to read a hint for cbe51351-f538-46a0-9f33-ddb8568a702e - digest mismatch for hint at position 52425457 in file cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints ERROR [HintsDispatcher:1] 2019-04-03 02:07:50,166 HintsDispatchExecutor.java:231 - Failed to dispatch hints file cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints: file is corrupted ({}) org.apache.cassandra.io.FSReadError: java.io.IOException: Digest mismatch exception at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:296) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:261) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.sendHints(HintsDispatcher.java:157) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.sendHintsAndAwait(HintsDispatcher.java:138) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.dispatch(HintsDispatcher.java:123) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsDispatcher.dispatch(HintsDispatcher.java:95) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] ... Caused by: java.io.IOException: Digest mismatch exception at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNextInternal(HintsReader.java:313) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] at org.apache.cassandra.hints.HintsReader$BuffersIterator.computeNext(HintsReader.java:287) ~[cassandra-all-3.0.14.1862.jar:3.0.14.1862] ...
As a result of the failure, startup fails and DSE shuts down:
ERROR [HintsDispatcher:1] 2019-04-03 02:07:50,167 StorageService.java:424 - Stopping gossiper WARN [HintsDispatcher:1] 2019-04-03 02:07:50,168 StorageService.java:315 - Stopping gossip by operator request INFO [HintsDispatcher:1] 2019-04-03 02:07:50,168 Gossiper.java:1538 - Announcing shutdown INFO [HintsDispatcher:1] 2019-04-03 02:07:50,168 StorageService.java:2181 - Node /10.1.2.3 state jump to shutdown ERROR [HintsDispatcher:1] 2019-04-03 02:07:52,169 StorageService.java:429 - Stopping RPC server INFO [HintsDispatcher:1] 2019-04-03 02:07:52,169 ThriftServer.java:142 - Stop listening to thrift clients ERROR [HintsDispatcher:1] 2019-04-03 02:07:52,169 StorageService.java:434 - Stopping native transport INFO [HintsDispatcher:1] 2019-04-03 02:07:52,172 Server.java:180 - Stop listening for CQL clients
What does this ERROR message mean?
This means that a file that is holding hinted handoff messages is corrupt and cannot be retrieved.
Why does this ERROR occur?
The implementation for Cassandra hints storage was rewritten in Cassandra 3.0 (CASSANDRA-6230) which resulted in moving the hint data from the system.hints
table to on disk files.
There are several reasons for hints files to get corrupted. The most common causes are:
- Hardware failure
- Bad disk region preventing reads
- DSE process getting terminated with a
kill
signal - Linux
oom-killer
- File system running out of disk space
In all the scenarios above except for bad disks, the hints file does not get completely "fsynced" to the disk because the file system write was interrupted. For this reason, the digest (think checksum) of the hints file does not match the expected value in the corresponding CRC file (the CRC file is a Cyclical Redundancy Check checksum file used to verify data integrity).
How do you fix this ERROR?
A corrupt hints file cannot be recovered directly. Its contents (mutations which did not make it to one or more replicas) can be "sourced" from other replicas in the cluster with nodetool repairs.
Step 1 - Run a shutdown of DSE to ensure there are no orphaned DSE threads still running on the affected node.
Step 2 - Check that there is no Java process bound to DSE ports still running using the following command:
$ sudo lsof -i -n -P | grep LISTEN | grep java
NOTE - If the OpsCenter agent is installed, a Java process will show up as being bound to port 61621
(default).
NOTE - The ports used by DSE vary based on the major release version. For example, the ports used by DSE 5.0 are documented in Configuring firewall port access. Check the document specific to the DSE version installed on the node.
Step 3 - Delete the problematic hints file in the hints_directory
. By default, the hints file is in /var/lib/cassandra/hints
.
Using the hints file identified in the example error message above, delete the offending hints file and its corresponding CRC file as follows:
$ cd /var/lib/cassandra/hints $ rm cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.hints $ rm cbe51351-f538-46a0-9f33-ddb8568a702e-1554123941969-1.crc
Step 4 - Start DSE on the node.
Step 5 - Monitor the progress of the startup as follows:
$ tail -f /var/log/cassandra/system.log
POST - After the node is back online, you must synchronize all the replicas by running nodetool repair -pr
, one node at a time until all nodes in all data centers have been repaired.