Full ERROR Message Example
ERROR [ValidationExecutor:36363] 2020-10-10 13:14:15,678 Validator.java:288 - Failed to move local merkle tree for [repair #6c456001-d19d-11ea-95ac-f555c96a5a0f on table/schema_updates, [(106602477818869041,118992989813163204], (6208714737812037719,6209511195928266782], (6213123279160086627,6213279401331079729]]] off heap
What does this ERROR message mean?
The error could occur when we decided to move the merkel trees off heap and invoke the message handler in certain steps of the repair process. In case it fails, we display the error above and treat it as a disk IO exception, the stack of which may or may not be displayed after the error message.
Why does this ERROR occur?
This error occurs because there were disk IO problems when the merkel trees should have been written to disk.
How do you fix this ERROR?
The underlying disk IO issue may be either a real disk issue, or a false positive caused by slowness in response from the disk because of IO overload/low number of IOPS available. First check that the disk has no problems (one can use utilities such as "fsck"). Then check that the node/disk was not overloaded at the time or that there are enough IOPS available (see the guidelines for the release and platform you are using in regards to the minimum IOPS requirement).
Once the disk IO problem is identified and addressed, the repair should be restarted. You can make sure that previous repair operations are no longer running. You can check on the status of repair operations through a variety of methods.
Log Entries method
You can check the status of repairs by verifying that the repair commands that have started also log messages of completion. A typical start to the repair process will provide an entry as follows:
INFO [Repair-Task-2] 2020-08-12 18:09:47,337 RepairRunnable.java:171 - Starting repair command #1 (01b36c80-dcc7-11ea-aa1a-a781ea849d47), repairing k eyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters : [], hosts: [], runAntiCompaction: false, # of ranges: 3, pull repair: false)
Verify that the repair command also has a finish statement. You can check by using the UUID of the repair (in this example: 01b36c80-dcc7-11ea-aa1a-a781ea849d47) or the repair # (#1). The two messages will print out in sequence. The following log snippet verifies that the repair #1 operation with a UUID of 01b36c80-dcc7-11ea-aa1a-a781ea849d47 has completed:
INFO [RepairJobTask:4] 2020-08-12 18:12:17,658 RepairSession.java:283 - [repair #01cd8430-dcc7-11ea-aa1a-a781ea849d47] Session completed successfully INFO [RepairJobTask:4] 2020-08-12 18:12:17,669 RepairRunnable.java:286 - Repair session 01cd8430-dcc7-11ea-aa1a-a781ea849d47 for range [(307445734561 8258602,-9223372036854775808], (-9223372036854775808,-3074457345618258603], (-3074457345618258603,3074457345618258602]] finished INFO [RepairJobTask:4] 2020-08-12 18:12:17,700 ActiveRepairService.java:478 - [repair #01b36c80-dcc7-11ea-aa1a-a781ea849d47] Not a global repair, wil l not do anticompaction INFO [RepairJobTask:4] 2020-08-12 18:12:17,730 RepairRunnable.java:373 - Repair command #1 finished in 2 minutes 30 seconds
Nodetool methods
A simple check on nodetool tpstats can quickly identify repair operations in progress. In the example below, the Repair#1 thread pool is currently active as well as the ValidationExecutor thread pool.
automaton@ip-10-101-34-118:~$ nodetool tpstats Pool Name Active Pending Completed Blocked All time blocked ReadStage 0 0 227 0 0 ContinuousPagingStage 0 0 0 0 0 Repair#1 1 1 1 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 0 0 2131 0 0 MutationStage 0 0 12729319 0 0 GossipStage 0 0 15031 0 0 RequestResponseStage 0 0 4474270 0 0 ReadRepairStage 0 0 28 0 0 CounterMutationStage 0 0 0 0 0 MemtablePostFlush 0 0 388 0 0 ValidationExecutor 2 2 1 0 0 MemtableFlushWriter 0 0 111 0 0 ViewMutationStage 0 0 0 0 0 CacheCleanupExecutor 0 0 0 0 0 MemtableReclaimMemory 0 0 111 0 0 PendingRangeCalculator 0 0 4 0 0 AntiCompactionExecutor 0 0 0 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher 0 0 32 0 0 Native-Transport-Requests 0 0 2228195 0 301 MigrationStage 0 0 23 0 0 PerDiskMemtableFlushWriter_0 0 0 110 0 0 Sampler 0 0 0 0 0 InternalResponseStage 0 0 2428882 0 0 AntiEntropyStage 0 0 10 0 0 Message type Dropped Latency waiting in queue (micros) 50% 95% 99% Max READ 0 0.00 0.00 0.00 0.00 RANGE_SLICE 0 N/A N/A N/A N/A _TRACE 0 N/A N/A N/A N/A HINT 3067 0.00 0.00 10090.81 322381.14 MUTATION 2177721 0.00 0.00 0.00 0.00 COUNTER_MUTATION 0 N/A N/A N/A N/A BATCH_STORE 0 N/A N/A N/A N/A BATCH_REMOVE 0 N/A N/A N/A N/A REQUEST_RESPONSE 0 0.00 0.00 0.00 0.00 PAGED_RANGE 0 N/A N/A N/A N/A READ_REPAIR 0 N/A N/A N/A N/A
Both of these should have values of (0) if no repair is running. However, if repairs are running for different keyspaces or tables, you may still see positive values that would not have an effect on the repair operation that you wish to run. Use this as a safe starting point. Anything other than a value of (0) should be investigated further.
To check whether a repair operation is occurring on the specific keyspace/table combination, use the nodetool compactionstats command to check whether a Validation compaction (Merkle Tree Generation) is currently in progress:
$ nodetool compactionstats pending tasks: 2 - keyspace1.standard1: 2 id compaction type keyspace table completed total unit progress 03151f60-dcc7-11ea-aa1a-a781ea849d47 Validation keyspace1 standard1 994280134 1820126750 bytes 54.63% 0aa353a0-dcc7-11ea-aa1a-a781ea849d47 Validation keyspace1 standard1 873720508 1820126750 bytes 48.00% Active compaction remaining time : 0h00m00s
If Validation compactions exist, the node is creating Merkle Trees for comparison against other nodes in order to determine if data needs to be streamed between nodes. In the example above, Validation compactions are occurring for the keyspace1.standard1 table. Make sure no Validation compactions are in process before starting a repair.
Finally, the Merkle tree comparison may already have occured and the node is past the Validation compaction stage. To make sure that the streaming operations are not in process, use the nodetool netstats command. The following output shows active data streams related to the keyspace1.standard1 repair:
$ nodetool netstats Mode: NORMAL Repair 028763b0-cc1e-11e4-a20c-a1d01a3fbf30 /54.174.19.98 Receiving 6 files, 117949006 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-162-Data.db 851792/17950738 bytes(4%) received from /54.174.19.98 Sending 2 files, 47709526 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db 3786324/46561942 bytes(8%) sent to /54.174.19.98 Repair 020ed850-cc1e-11e4-a20c-a1d01a3fbf30 /54.174.245.247 Receiving 4 files, 93304584 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-161-Data.db 6094594/46561942 bytes(13%) received from /54.174.245.247 Sending 2 files, 47709526 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db 34195028/46561942 bytes(73%) sent to /54.174.245.247 Repair 018c88f0-cc1e-11e4-a20c-a1d01a3fbf30 /54.153.39.203 (using /172.31.10.65) Receiving 3 files, 49959102 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-160-Data.db 9371380/46561942 bytes(20%) received from /54.153.39.203 /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-159-Data.db 2533414/2533414 bytes(100%) received from /54.153.39.203 Sending 2 files, 47709526 bytes total /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-158-Data.db 1147584/1147584 bytes(100%) sent to /54.153.39.203 /var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db 46561942/46561942 bytes(100%) sent to /54.153.39.203 Read Repair Statistics: Attempted: 12 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 1 2 1 Small messages n/a 0 12786143 36337 Gossip messages n/a 0 15708 52
You should see output similar to the following if no repair streams are in progress:
$ nodetool netstats Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 11 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 1 2 1 Small messages n/a 0 12786105 36337 Gossip messages n/a 0 15025 52