Summary
During DSE startup or at busy periods, Native Transport Requests (NTR) are blocked and client connections fail.
Applies to
- DSE 5.0
Scenario and Symptoms
When starting a node or when experiencing high workload volumes, new application connect requests are blocked and result in multiple errors in the system.log, like:
INFO [epollEventLoopGroup-x-x] Message.java:627 - Unexpected exception during request; channel = [id: 0x095866ac, L:/<local IP>:9042 ! R:/<remote IP>:<Port>]
java.nio.channels.ClosedChannelException: null
An increase in the 'All time blocked' Native-Transport-Requests shows in the output of the nodetool info command. Your application may report multiple failures trying to connect to the database
Increased CPU load shows in the output of the netstat -an command, along with multiple connections in CLOSE_WAIT state on port 9042:
tcp 10 0 192.168.1.101:9042 192.268.1.102:33004 CLOSE_WAIT keepalive (185.97/0/0)
Cause
Application connection requests can flood a node with new connection requests, especially during busy periods. By default, Cassandra is configured to accept 1024 queued NTR connect requests (CASSANDRA-11363). Any additional requests after this thread pool limit has been reached will be rejected, resulting in a blocked NTR and a failed application connect call.
This problem can be exacerbated if the application has been configured to retry connecting if the original connect call is rejected, as rejected connections are retried and further new connections are also rejected, effectively creating a race condition.
The CLOSE_WAIT socket connections will hold a port until it has timed out while awaiting a close confirmation from the application, resulting in fewer free ports for new connections and added load.
Solution
Define max_queued_native_transport_requests in the jvm.options. Set it to a suitable level, for example:
JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=3072"
Restart DSE for the change to take effect.
Note: If issues still persist with ClosedChannelException
errors increasing exponentially when the problem occurs, the next step is to check the cluster.builder configuration, specifically any properties concerning re-connections, ie ConstantReconnectionPolicy.