Summary
This article discusses an issue where the JVM can run out of native threads as a result of a high number of connections to the cluster.
Symptoms
In situations where the is a high number of client connections to the Cassandra client (thrift) port (default is 9160
), a node under a lot of pressure can start to report lots of connection errors in the system.log
like this:
ERROR [Thrift:36300] 2017-04-21 08:06:37,938 CustomTThreadPoolServer.java:224 - Error occurred during processing of message. java.lang.RuntimeException: Failed to open server transport: unknown at com.datastax.bdp.transport.server.TNegotiatingServerTransport$Factory.getTransport(TNegotiatingServerTransport.java:507) ~[dse-4.7.5.jar:4.7.5] at com.datastax.bdp.transport.server.TNegotiatingServerTransport$Factory.getTransport(TNegotiatingServerTransport.java:395) ~[dse-4.7.5.jar:4.7.5] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:197) ~[cassandra-all-2.1.11.908.jar:4.7.5] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60] Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) ~[libthrift-0.9.3.jar:0.9.3] at com.datastax.bdp.transport.server.TPreviewableTransport.readUntilEof(TPreviewableTransport.java:66) ~[dse-4.7.5.jar:4.7.5] at com.datastax.bdp.transport.server.TPreviewableTransport.preview(TPreviewableTransport.java:42) ~[dse-4.7.5.jar:4.7.5] at com.datastax.bdp.transport.server.TNegotiatingServerTransport.open(TNegotiatingServerTransport.java:174) ~[dse-4.7.5.jar:4.7.5] at com.datastax.bdp.transport.server.TNegotiatingServerTransport$Factory.getTransport(TNegotiatingServerTransport.java:499) ~[dse-4.7.5.jar:4.7.5] ... 5 common frames omitted Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:209) ~[na:1.8.0_60] at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_60] at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[na:1.8.0_60] at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[na:1.8.0_60] at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_60] at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ~[libthrift-0.9.3.jar:0.9.3] ... 9 common frames omitted
A DSE node can get overwhelmed and eventually stop processing requests. The node may appear down or unresponsive to other nodes and/or clients, with an OutOfMemoryError
reported in the system.log
:
ERROR [Thread-11] 2017-04-21 08:26:02,850 CassandraDaemon.java:227 - Exception in thread Thread[Thread-11,5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) ~[na:1.8.0_60] at java.lang.Thread.start(Thread.java:714) ~[na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) ~[na:1.8.0_60] at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) ~[na:1.8.0_60] at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113) ~[cassandra-all-2.1.11.908.jar:4.7.5] at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:137) ~[cassandra-all-2.1.11.908.jar:4.7.5]
Cause
Each client connection results in a thread being allocated from the pool of available threads. When a request is completed, the thread is released back to the pool for subsequent requests.
A sustained high number of connections can eventually exhaust the available resources if the node is unable to keep up with the requests and the node becomes unresponsive to new requests.
Solution
Ensure that the nodes are configured as per the Recommended production settings for Apache Cassandra. In particular, ensure that the maximum number of processes is set to 32768.
Monitor the connections coming into each node using utilities such as netstat
, for example:
$ netstat -an | grep 9160
Validate the source clients to ensure the connections are as expected, i.e. application-related and not a rogue process.
Finally, if the number of client connections consistently exceed the total number of connections available in the cluster, consider scaling out the cluster by adding more nodes to increase capacity.