ERROR Message Example
ERROR [Repair-Task:1] 2019-06-21 06:40:44,895 SystemDistributedKeyspace.java:406 - Error executing query INSERT INTO system_distributed.parent_repair_history (parent_id, keyspace_name, columnfamily_names, requested_ranges, started_at, options) VALUES (11111111-0000-0000-0000-888888888888, 'system_auth', { 'roles','role_permissions','role_members' }, { '(1607483561684771030,1656713833712314075]' }, toTimestamp(now()), { 'trace': 'false','forceRepair': 'false','hosts': '','parallelism': 'parallel','dataCenters': '','previewKind': 'NONE','incremental': 'false','pullRepair': 'false','primaryRange': 'false','jobThreads': '1' })
TYPICALLY PRODUCES A STACK TRACE SIMILAR TO THE FOLLOWING (STACK TRACE WILL DIFFER BASED ON VERSION):
org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses. at org.apache.cassandra.service.AbstractWriteHandler$1.lambda$subscribeActual$0(AbstractWriteHandler.java:158) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:221) at org.apache.cassandra.service.AbstractWriteHandler$TimeoutAction.accept(AbstractWriteHandler.java:216) at org.apache.cassandra.concurrent.TPCTimeoutTask.run(TPCTimeoutTask.java:43) at org.apache.cassandra.concurrent.TPCHashedWheelTimer.lambda$onTimeout$0(TPCHashedWheelTimer.java:43) at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:498) at org.apache.cassandra.utils.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:573) at org.apache.cassandra.utils.HashedWheelTimer$Worker.run(HashedWheelTimer.java:329) at org.apache.cassandra.concurrent.TPCRunnable.run(TPCRunnable.java:68) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.process(EpollTPCEventLoopGroup.java:920) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processTasks(EpollTPCEventLoopGroup.java:892) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.runScheduledTasks(EpollTPCEventLoopGroup.java:980) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.processEvents(EpollTPCEventLoopGroup.java:774) at org.apache.cassandra.concurrent.EpollTPCEventLoopGroup$SingleCoreEventLoop.run(EpollTPCEventLoopGroup.java:441) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
What does this ERROR message mean?
This error is generated from running repair tasks. Repair tasks will keep track of the repair session status in system_distributed.parent_repair_history and system_distributed.parent_repair_history tables. The repair tasks will write the repair session information in these 2 tables with consistency level of ONE, CL=ONE. The error indicates an update or insert query against these 2 tables from the repair task failed due to the consistency level (CL) being unable to be met.
Why does this ERROR occur?
The error typically occurs due to the following reasons:
- Overloaded nodes
- Communication issues among the nodes due to the network issue
When nodes become unresponsive due to load or communication issues, the update or insert queries against these 2 tables will fail due as the consistency level (CL) cannot be met.
How do you fix this ERROR?
When this error occurs, it generally indicates the nodes in the cluster are not responsive. Users can also observe the slowness or failure of user queries.
Overloaded nodes
Examine the system.log for signs the nodes in the cluster are overloaded to the point when the error started to occur.
This can include dropped messages, long GC pauses, etc... for example:
INFO [ScheduledTasks:1] 2020-05-23 14:09:20,509 MessagingService.java:1273 - READ messages were dropped in last 5000 ms: 2300 internal and 136 cross node. Mean internal dropped latency: 5430 ms and Mean cross-node dropped latency: 5960 ms
WARN [Service Thread] 2020-05-23 14:09:15,508 GCInspector.java:282 - G1 Young Generation GC in 5170ms. G1 Eden Space: 18035507200 -> 0; G1 Old Gen: 12280584520 -> 26468408336; G1 Survivor Space: 1132462080 -> 662700032;
If the nodes in the cluster are overloaded, it is necessary to throttle the workload, check the access patterns (e.g if running expensive queries) or add resources/nodes to better suit the cluster's needs.
Network issues
Check the output of nodetool status from all the nodes to see if there is any node showing DN status
e.g.
-- Address Load Tokens Owns Host ID Rack DN 10.100.100.100 4.38 GiB 64 ? fdfc950d-6381-4c43-9bfc-ec567b06f360 rack1
Examine the system.log or debug.log for any gossip issue, for example:
INFO [GossipTasks:1] 2020-01-04 03:55:49,320 Gossiper.java:1205 - InetAddress /10.100.100.101 is now DOWN
DEBUG [InternalResponseStage:13] 2020-07-02 05:24:15,203 Gossiper.java:1213 - Failed to receive echo reply from /10.100.100.101
If the network issue occurs, simply do the following test between the nodes to verify the connectivity:
ping
ping <ip-address of the down node>
telnet
telnet <ip-address of the down node> 7000
OR
telnet <ip-address of the down node> 7001
(if the node to node encryption is enabled)
If either of the above commands fails, more investigation from the network layer will be required.