Full ERROR Message Example
ERROR [main] 2020-01-23 15:30:01,384 StorageService.java:1527 - Error while waiting on bootstrap to complete. Bootstrap will have to be restarted.
Typically followed by the following exception that provides more details about why the error occurred:
java.util.concurrent.ExecutionException: org.apache.cassandra.streaming.StreamException: Stream failed
at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[guava-18.0.jar:na]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[guava-18.0.jar:na]
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1522) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:984) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:692) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:623) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:393) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:465) [dse-core-5.1.2.jar:5.1.2]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:599) [cassandra-all-3.11.0.1758.jar:3.11.0.1758]
at com.datastax.bdp.DseModule.main(DseModule.java:93) [dse-core-5.1.2.jar:5.1.2]
Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
What does this error mean?
This error means that while bootstrapping a new node in the cluster, the data-streaming to the new node, from other replicas failed.
Why does this error occur?
Typically, this error occurs due to network-related issues, most commonly related to tcp connection timeouts related to long-running streaming connections.
How do you fix this error?
In order to optimize the data-streaming process related to bootstrapping, application layer keep-alives were added to the streaming protocol to prevent idle incoming connections from timing out and failing the stream session.
This was done in order to be able to detect long hanging streams. There are settings in the cassandra.yaml file that allow you to tune the timeouts associated with the streaming connections. These settings are as follows:
streaming_keep_alive_period_in_secs
Interval to send keep-alive messages. The stream session fails when a keep-alive message is not received for 2 keep-alive cycles. When not set, the default is 300 seconds (5 minutes) so that a stalled stream times out in 10 minutes.
Default: commented out (300)
We could also set:
stream_throughput_outbound_megabits_per_sec
Default: 200. note Throttle for the throughput of all outbound streaming file transfers on a node. The database does mostly sequential I/O when streaming data during bootstrap or repair. This can saturate the network connection and degrade client (RPC) performance.
inter_dc_stream_throughput_outbound_megabits_per_sec
Default: 200. note Throttle for all streaming file transfers between datacenters, and for network stream traffic as configured with
stream_throughput_outbound_megabits_per_sec.
Note: Should be set to a value less than or equal to stream_throughput_outbound_megabits_per_sec since it is a subset of total throughput.
Additionally, as recommended by DataStax, there are network kernel settings that should also be tuned in order to help mitigate network issues related to the associated tcp connection timeouts.
During low traffic intervals, a firewall configured with an idle connection timeout can close connections to local nodes and nodes in other data centers. To prevent connections between nodes from timing out, set the following network kernel settings:
sudo sysctl -w \
net.ipv4.tcp_keepalive_time=60 \
net.ipv4.tcp_keepalive_probes=3 \
net.ipv4.tcp_keepalive_intvl=10
These values set the TCP keepalive timeout to 60 seconds with 3 probes, 10 seconds gap between each. The settings detect dead TCP connections after 90 seconds (60 + 10 + 10 + 10). The additional traffic is negligible, and permanently leaving these settings is not an issue