DataStax Help Center

FAQ - How to reduce the impact of streaming errors or failures

Summary

This article discusses a workaround to reduce the impact of streaming errors or failures on the cluster. This workaround is not intended as a replacement for good capacity planning and reliable network connections.

Symptoms

When performing maintenance operations such as repairs, bootstrapping or decommissions, streaming sessions can sometimes fail based on some conditions. Here is a sample error reported in the system.log:

ERROR [STREAM-OUT-/10.1.2.3] 2015-10-21 13:36:43,665  StreamSession.java:502 - [Stream #830b0380-77f8-11e5-9db5-950bfc64178a] Streaming error occurred
java.io.IOException: Broken pipe
    at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.7.0_51]
    at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:433) ~[na:1.7.0_51]
    at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:565) ~[na:1.7.0_51]
...

This article does NOT apply in the following situations:

  • failed streaming due to incorrect SSL configuration
  • failed streaming due to corrupt SSTables
  • failed streaming due to schema disagreement

Cause

Streaming failures reported as above are most commonly caused by network interruptions from unstable connections between source and target nodes.

Another cause occurs when either the source or target node become unresponsive because they are under load, e.g. JVM pauses from constant garbage-collection pressure.

Workaround

This is not a fix for the underlying problems outlined above. It is simply an attempt to minimise the impact of streaming failures to cluster operations.

Step 1 - For the first node in the cluster, set the following property in cassandra.yaml:

streaming_socket_timeout_in_ms: <value>

NOTE - In earlier versions of Cassandra, the default value for this property was 0. This results in 1 of 2 things: (a) when a stream hangs, it never times out requiring a restart to clear the session, or (b) when a stream is interrupted, it immediately gets marked as failed and is never retried.

CASSANDRA-8611 changed the default behaviour in Cassandra 2.1.10 (included in DataStax Enterprise 4.7.4 and 4.8.1) with a timeout value of 3600000 ms (1 hour). This forces the stream to timeout and retried for 3 times before finally being marked as failed.

CASSANDRA-11840 suggests using a larger timeout of 48 hours (172800000 ms) for Cassandra 2.1.15. This has proved successful in situations where there are larger amounts of data to stream.

Step 2 - Restart DSE on the node.

Step 3 - Repeat steps 1 and 2 one node at a time until all nodes in the cluster have been reconfigured.

Step 4 - Attempt to run the operation (e.g. bootstrap, repair, decommission) again.

Solution

As stated above, the workaround provided is not intended as a fix. For optimal operation of the cluster, ensure the following:

  • underlying network is stable and reliable
  • cluster capacity is correctly sized to to minimise occurrences of unresponsive nodes
  • keep node density as close to 500GB per node where possible

See also

Cassandra configuration - The cassandra.yaml configuration file

Cassandra JIRA - CASSANDRA-8611 Give streaming_socket_timeout_in_ms a non-zero default

Cassandra JIRA - CASSANDRA-11840 Set a more conservative default to streaming_socket_timeout_in_ms

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk