Some customers have reported nodes randomly freeze and become unresponsive for an unknown reason.
Unresponsive nodes have the following characteristics:
- no garbage collection activity in the logs
- no compactions in progress
- unable to run
- no response on native transport, Thrift or JMX ports
- low or close to zero CPU utilisation
- high CPU utilisation which eventually leads to node being unresponsive
In some instances where a customer is able to generate a thread dump, the jstack output is dominated by
BLOCKED threads, for example:
Thread 104823: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=68, line=2082 (Compiled frame) - java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame) - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame) - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
The problem is due to a Linux
futex_wait() bug which causes user processes to deadlock and hang. A
futex_wait() call (and any processes making a call) can stay blocked forever. JVM synchronization method calls such as
unpark() all make
futex_wait() calls at some point and fall victim to the bug.
The bug exists in RHEL 6.6, CentOS 6.6 and related Linux distributions. At the time of writing, RHEL 7.x/CentOS 7.x were also affected.
Earlier kernels in RHEL 6.5, CentOS 6.5 (and related distributions) are not affected by the bug.
Upgrade to Linux kernels containing the
get_futex_key_refs() fix such as RHEL 6.6.z and CentOS 6.6.z.
Here is an example to check on a RHEL server for the installed patches:
$ sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref - [kernel] futex: Mention key referencing differences between shared and private futexes (Larry Woodman)  - [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) 
For further information on distributions which contain the fix, please consult the relevant vendor or distributor of the operating system.
Google Group post - Linux futex_wait() bug
InfoQ article - Serious Red Hat Linux Bug Affects Haswell-based Servers
GitHub commit - futex: Ensure get_futex_key_refs() always implies a barrier