DataStax Help Center

Nodes appear unresponsive due to a Linux futex_wait() kernel bug

Summary

Some customers have reported nodes randomly freeze and become unresponsive for an unknown reason.

Symptoms

Unresponsive nodes have the following characteristics:

  • no garbage collection activity in the logs
  • no compactions in progress
  • unable to run nodetool commands
  • no response on native transport, Thrift or JMX ports
  • low or close to zero CPU utilisation
  • high CPU utilisation which eventually leads to node being unresponsive

In some instances where a customer is able to generate a thread dump, the jstack output is dominated by BLOCKED threads, for example:

Thread 104823: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=68, line=2082 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

Cause

The problem is due to a Linux futex_wait() bug which causes user processes to deadlock and hang. A futex_wait() call (and any processes making a call) can stay blocked forever. JVM synchronization method calls such as lock(), park() and unpark() all make futex_wait() calls at some point and fall victim to the bug.

The bug exists in RHEL 6.6, CentOS 6.6 and related Linux distributions. At the time of writing, RHEL 7.x/CentOS 7.x were also affected.

Earlier kernels in RHEL 6.5, CentOS 6.5 (and related distributions) are not affected by the bug.

Solution

Upgrade to Linux kernels containing the get_futex_key_refs() fix such as RHEL 6.6.z and CentOS 6.6.z.

Here is an example to check on a RHEL server for the installed patches:

$ sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
- [kernel] futex: Mention key referencing differences between shared and private futexes (Larry Woodman) [1167405]
- [kernel] futex: Ensure get_futex_key_refs() always implies a barrier (Larry Woodman) [1167405]

For further information on distributions which contain the fix, please consult the relevant vendor or distributor of the operating system.

See also

Google Group post - Linux futex_wait() bug

InfoQ article - Serious Red Hat Linux Bug Affects Haswell-based Servers

GitHub commit - futex: Ensure get_futex_key_refs() always implies a barrier

Was this article helpful?
3 out of 3 found this helpful
Have more questions? Submit a request

Comments

Powered by Zendesk