This article discusses a workaround to reduce the impact of streaming errors or failures on the cluster. This workaround is not intended as a replacement for good capacity planning and reliable network connections.
When performing maintenance operations such as repairs, bootstrapping or decommissions, streaming sessions can sometimes fail based on some conditions. Here is a sample error reported in the
ERROR [STREAM-OUT-/10.1.2.3] 2015-10-21 13:36:43,665 StreamSession.java:502 - [Stream #830b0380-77f8-11e5-9db5-950bfc64178a] Streaming error occurred java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.7.0_51] at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:433) ~[na:1.7.0_51] at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:565) ~[na:1.7.0_51] ...
This article does NOT apply in the following situations:
- failed streaming due to incorrect SSL configuration
- failed streaming due to corrupt SSTables
- failed streaming due to schema disagreement
Streaming failures reported as above are most commonly caused by network interruptions from unstable connections between source and target nodes.
Another cause occurs when either the source or target node become unresponsive because they are under load, e.g. JVM pauses from constant garbage-collection pressure.
This is not a fix for the underlying problems outlined above. It is simply an attempt to minimise the impact of streaming failures to cluster operations.
Step 1 - For the first node in the cluster, set the following property in
NOTE - In earlier versions of Cassandra, the default value for this property was 0. This results in 1 of 2 things: (a) when a stream hangs, it never times out requiring a restart to clear the session, or (b) when a stream is interrupted, it immediately gets marked as failed and is never retried.
CASSANDRA-8611 changed the default behaviour in Cassandra 2.1.10 (included in DataStax Enterprise 4.7.4 and 4.8.1) with a timeout value of 3600000 ms (1 hour). This forces the stream to timeout and retried for 3 times before finally being marked as failed.
CASSANDRA-11840 suggests using a larger timeout of 48 hours (172800000 ms) for Cassandra 2.1.15. This has proved successful in situations where there are larger amounts of data to stream.
Step 2 - Restart DSE on the node.
Step 3 - Repeat steps 1 and 2 one node at a time until all nodes in the cluster have been reconfigured.
Step 4 - Attempt to run the operation (e.g. bootstrap, repair, decommission) again.
As stated above, the workaround provided is not intended as a fix. For optimal operation of the cluster, ensure the following:
- underlying network is stable and reliable
- cluster capacity is correctly sized to to minimise occurrences of unresponsive nodes
- keep node density as close to 500GB per node where possible
Cassandra configuration - The cassandra.yaml configuration file