Summary
This article describes how to identify CPU and IO bottlenecks to assist with diagnosing slow queries or read and write timeouts.
Applies to
Most Linux distributions. These examples were tested with the following versions:
- DSE 6.7, 6.0, 5.1, 5.0
- DDAC
- RHEL 7.5
- Ubuntu 16.04-18.04
Summary
It's helpful to observe the state of the system resources when you want to diagnose slow queries or read and write timeouts. Normal monitoring systems can unfortunately use too much aggregation to explain major events. A common widely available tool that can be used on servers is iostat. The iostat command is typically available in the sysstat package for your Linux distribution.
Collecting metrics
We need to run iostat during a busy period where performance problems are experienced or during a high peak load.
- Run iostat -x -c -d -t 1 360 > iostat.txt
- Run lscpu and read the 'Thread(s) per core' output
How to finding high CPU
Use the information in the following scenarios to identify high CPU and IO bottlenecks.
if 'Thread(s) per core: 2'
Count the number of times that idle+iowait is under 50%, if occurs more than 18 times then the node is CPU bottlenecked during that time window.
With grep and awk:
grep avg-cpu -A1 ~/iostat.txt | grep -v "avg-cpu" | grep -v "-" | awk '($6+$4)<50.0{printf("%5.1f\n", $6+$4)}' | wc -l
if 'Thread(s) per core: 1'
Count the number of times that idle+iowait is under 10%, if occurs more than 18 times then the node is CPU bottlenecked during that time window.
With grep and awk:
grep avg-cpu -A1 ~/iostat.txt | grep -v "avg-cpu" | grep -v "-" | awk '($6+$4)<10.0{printf("%5.1f\n", $6+$4)}' | wc -l
Finding out of the IO is busy
There are two metrics to evaluate disk bottlenecks: avgqu-sz (average queue size) and iowait% (%cpu time spent waiting on I/O).
Finding I/O bottlenecks via iowait
Count the number of times iowait% is over 5%, if occurs more than 18 times in the log then you have identified a definite IO bottleneck during that time window.
With grep and awk:
grep avg-cpu -A1 ~/iostat.txt | grep -v "avg-cpu" | grep -v "-" | awk '$4>5.0{print $4}' | wc -l
Finding I/O Bottlenecks in individual drives
When we see avgqu-sz over 1, this indicates saturation of the device and while different IO behaves differently when saturated, this reading is a good signal that using more IO will not help the performance of the system.
Count the number of times avgqu-sz of a drive is over 1.0, if occurs more than 18 times in the log then you have good persistent saturation for that drive.
See also
As always, DataStax recommends searching the documentation for useful information, including: