Example
ERROR [AntiEntropyStage:1] 2017-10-25 11:57:52,725 RepairMessageVerbHandler.java:170 - Got error, removing parent repair session
Typically followed with an additional exception like the following:
ERROR [AntiEntropyStage:1] 2017-10-25 11:57:52,727 CassandraDaemon.java:207 - Exception in thread Thread[AntiEntropyStage:1,5,main]
java.lang.RuntimeException: java.lang.RuntimeException: Parent repair session with id = b149a8c0-b97b-11e7-93f9-65e9a70ec973 has failed.
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:173) ~[cassandra-all-3.0.12.1656.jar:3.0.12.1656]
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67) ~[cassandra-all-3.0.12.1656.jar:3.0.12.1656]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_65]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_65]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$148(NamedThreadFactory.java:79) [cassandra-all-3.0.12.1656.jar:3.0.12.1656]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_65]
Caused by: java.lang.RuntimeException: Parent repair session with id = b149a8c0-b97b-11e7-93f9-65e9a70ec973 has failed.
at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:400) ~[cassandra-all-3.0.12.1656.jar:3.0.12.1656]
at org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:434) ~[cassandra-all-3.0.12.1656.jar:3.0.12.1656]
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:145) ~[cassandra-all-3.0.12.1656.jar:3.0.12.1656]
... 7 common frames omitted
What does this ERROR message mean?
The removing parent repair session error means that an exception occurred during one of the stages of the repair process that caused Cassandra to abort the repair process altogether. It is thrown during a try/catch block where the precise operation is run through a switch statement and specific code paths are followed. The different operations could be any of the following:
- Prepare Phase
- Snapshotting of existing Tables
- Merkle Tree Generation
- Streaming
- Cleanup
The follow up ERROR and stack trace may provide additional details as to which phase the ERROR occurred in.
Why does this ERROR occur?
This error typically occurs when trying to start a new repair session on the same range of data before a previous session has completed. You should avoid running multiple repair operations on the same range of data concurrently.
How do you fix this ERROR?
When this ERROR occurs, the repair operation is no longer running. Restarting the repair operation is the only option. In order to avoid the error, make sure that previous repair operations are no longer running. You can check on the status of repair operations through a variety of methods.
Log Entries method
You can check the status of repairs by verifying that the repair commands that have started also log messages of completion. A typical start to the repair process will provide an entry as follows:
INFO [Repair-Task-2] 2020-08-12 18:09:47,337 RepairRunnable.java:171 - Starting repair command #1 (01b36c80-dcc7-11ea-aa1a-a781ea849d47), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters
: [], hosts: [], runAntiCompaction: false, # of ranges: 3, pull repair: false)
Verify that the repair command also has a finish statement. You can check by using the UUID of the repair (in this example: 01b36c80-dcc7-11ea-aa1a-a781ea849d47) or the repair # (#1). The two messages will print out in sequence. The following log snippet verifies that the repair #1 operation with a UUID of 01b36c80-dcc7-11ea-aa1a-a781ea849d47 has completed:
INFO [RepairJobTask:4] 2020-08-12 18:12:17,658 RepairSession.java:283 - [repair #01cd8430-dcc7-11ea-aa1a-a781ea849d47] Session completed successfully
INFO [RepairJobTask:4] 2020-08-12 18:12:17,669 RepairRunnable.java:286 - Repair session 01cd8430-dcc7-11ea-aa1a-a781ea849d47 for range [(307445734561
8258602,-9223372036854775808], (-9223372036854775808,-3074457345618258603], (-3074457345618258603,3074457345618258602]] finished
INFO [RepairJobTask:4] 2020-08-12 18:12:17,700 ActiveRepairService.java:478 - [repair #01b36c80-dcc7-11ea-aa1a-a781ea849d47] Not a global repair, wil
l not do anticompaction
INFO [RepairJobTask:4] 2020-08-12 18:12:17,730 RepairRunnable.java:373 - Repair command #1 finished in 2 minutes 30 seconds
Nodetool methods
nodetool tpstats
A simple check on nodetool tpstats can quickly identify repair operations in progress. In the example below, the Repair#1 thread pool is currently active as well as the ValidationExecutor thread pool.
automaton@ip-10-101-34-118:~$ nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
ReadStage 0 0 227 0 0
ContinuousPagingStage 0 0 0 0 0
Repair#1 1 1 1 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 2131 0 0
MutationStage 0 0 12729319 0 0
GossipStage 0 0 15031 0 0
RequestResponseStage 0 0 4474270 0 0
ReadRepairStage 0 0 28 0 0
CounterMutationStage 0 0 0 0 0
MemtablePostFlush 0 0 388 0 0
ValidationExecutor 2 2 1 0 0
. . .;
AntiEntropyStage 0 0 10 0 0
. . .
Both of these should have values of (0) if no repair is running. However, if repairs are running for different keyspaces or tables, you may still see positive values that would not have an effect on the repair operation that you wish to run. Use this as a safe starting point. Anything other than a value of (0) should be investigated further.
nodetool compactionstats
To check whether a repair operation is occurring on the specific keyspace/table combination, use the nodetool compactionstats command to check whether a Validation compaction (Merkle Tree Generation) is currently in progress:
$ nodetool compactionstats
pending tasks: 2
- keyspace1.standard1: 2
id compaction type keyspace table completed total unit progress
03151f60-dcc7-11ea-aa1a-a781ea849d47 Validation keyspace1 standard1 994280134 1820126750 bytes 54.63%
0aa353a0-dcc7-11ea-aa1a-a781ea849d47 Validation keyspace1 standard1 873720508 1820126750 bytes 48.00%
Active compaction remaining time : 0h00m00s
If Validation compactions exist, the node is creating Merkle Trees for comparison against other nodes in order to determine if data needs to be streamed between nodes. In the example above, Validation compactions are occurring for the keyspace1.standard1 table. Make sure no Validation compactions are in process before starting a repair.
Finally, the Merkle tree comparison may already have occurred and the node is past the Validation compaction stage. To make sure that the streaming operations are not in process, use the nodetool netstats command. The following output shows active data streams related to the keyspace1.standard1 repair:
nodetool netstats
Mode: NORMAL
Repair 028763b0-cc1e-11e4-a20c-a1d01a3fbf30
/54.174.19.98
Receiving 6 files, 117949006 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-162-Data.db
851792/17950738 bytes(4%) received from /54.174.19.98
Sending 2 files, 47709526 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db
3786324/46561942 bytes(8%) sent to /54.174.19.98
Repair 020ed850-cc1e-11e4-a20c-a1d01a3fbf30
/54.174.245.247
Receiving 4 files, 93304584 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-161-Data.db
6094594/46561942 bytes(13%) received from /54.174.245.247
Sending 2 files, 47709526 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db
34195028/46561942 bytes(73%) sent to /54.174.245.247
Repair 018c88f0-cc1e-11e4-a20c-a1d01a3fbf30
/54.153.39.203 (using /172.31.10.65)
Receiving 3 files, 49959102 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-160-Data.db
9371380/46561942 bytes(20%) received from /54.153.39.203
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-tmp-jb-159-Data.db
2533414/2533414 bytes(100%) received from /54.153.39.203
Sending 2 files, 47709526 bytes total
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-158-Data.db
1147584/1147584 bytes(100%) sent to /54.153.39.203
/var/lib/cassandra/data/Keyspace1/Standard1/Keyspace1-Standard1-jb-157-Data.db
46561942/46561942 bytes(100%) sent to /54.153.39.203
Read Repair Statistics:
Attempted: 12
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 1 2 1
Small messages n/a 0 12786143 36337
Gossip messages n/a 0 15708 52
You should see output similar to the following if no repair streams are in progress:
$ nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 11
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 1 2 1
Small messages n/a 0 12786105 36337
Gossip messages n/a 0 15025 52
Using the --trace option with nodetool repair
If repairs continue to fail, you can utilize the --trace option when running the repair. This will run the repair in a DEBUG mode to STDOUT as well as documenting in the system_traces keyspace. With this information, you should be able to see the specific portion of the repair operation where it failed. This information, along with the logs from the various replicas, will be valuable to include if you need to file a support ticket.