Linux Performance Monitoring and Troubleshooting
When a Linux system feels slow, the instinct to blame "the application" or "the
hardware" is strong, but the truth usually hides in the data. Effective performance
analysis follows a structured methodology, uses the right tools to gather metrics across
CPU, memory, disk, and network, and reasons from evidence rather than guesswork. This
guide covers the essential command-line tools, the USE method for systematic analysis,
and the perf and strace utilities for deep-dive investigation.
Quick-Look Tools
top and htop
top ships with every Linux installation and gives a real-time, refreshing view of
processes sorted by resource consumption:
top # interactive process viewer
top -bn1 | head -20 # batch mode, one iteration (useful in scripts)
Key fields: %CPU, %MEM, VIRT (virtual memory), RES (resident/physical memory),
S (state: R=running, S=sleeping, D=uninterruptible sleep). Press 1 inside top to
show per-CPU utilisation; M to sort by memory; P by CPU.
htop is a more user-friendly replacement with colour, mouse support, tree view, and
easier filtering:
htop # interactive
htop -p 1234,5678 # monitor specific PIDs
vmstat
vmstat provides a concise summary of system-wide virtual memory, CPU, and I/O
activity:
vmstat 1 10 # print stats every 1 second, 10 iterations
Sample output:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 512340 81024 1024320 0 0 12 48 320 580 8 2 89 1 0
Key columns: r (runnable processes, i.e., CPU demand), b (blocked on I/O), si/so
(swap in/out -- should be zero on a healthy system), wa (I/O wait percentage), us
(user CPU), sy (system/kernel CPU).
iostat
iostat from the sysstat package reports per-device I/O metrics:
iostat -xz 1 # extended stats, skip idle devices, every 1 sec
Critical columns: %util (how busy the device is; 100% means saturated), await (avg
ms per I/O request including queue time), r_await/w_await (read/write latency),
aqu-sz (average queue depth).
free and /proc/meminfo
free -h # human-readable memory summary
total used free shared buff/cache available
Mem: 31Gi 8.2Gi 1.1Gi 320Mi 22Gi 22Gi
Swap: 8.0Gi 0B 8.0Gi
The available column is the best estimate of memory available for new applications
without swapping. For deeper detail:
cat /proc/meminfo | head -15
Key entries: MemTotal, MemAvailable, Buffers, Cached, SwapTotal, SwapFree,
Dirty, AnonPages, Slab.
Understanding Load Average
uptime
# 14:23:07 up 42 days, load average: 2.15, 1.80, 1.45
The three numbers represent the 1-minute, 5-minute, and 15-minute exponentially damped moving averages of the number of tasks in a runnable or uninterruptible state. On a system with 4 CPU cores, a load average of 4.0 means the CPUs are exactly fully utilised. Values above the core count indicate tasks are waiting.
Important: on Linux, load average includes processes in uninterruptible I/O sleep (state D), so high load with low CPU can indicate I/O bottlenecks rather than CPU saturation.
sar: Historical Performance Data
The sar command (from sysstat) reads data collected by the sa1/sa2 cronjobs that
run every 10 minutes by default:
sar -u 1 5 # CPU usage, 1-second interval, 5 samples
sar -r # memory utilisation for today
sar -d # disk I/O for today
sar -n DEV # network interface stats for today
sar -q # load average and run queue length
sar -f /var/log/sysstat/sa15 # read data from a specific day (the 15th)
sar is invaluable for answering "what happened at 3 AM?" without needing a full
monitoring stack.
Deep-Dive Tools
perf: Hardware Performance Counters
The perf tool interfaces with the kernel's performance counter subsystem to provide
CPU-level profiling:
# See which functions are hottest system-wide (like top for functions)
perf top
# Record a profile of a specific command
perf record -g -- ./my_program
perf report # interactive TUI to browse the profile
# Record a running process for 30 seconds
perf record -g -p 1234 -- sleep 30
perf report
# Count specific events
perf stat -e cache-misses,cache-references,instructions,cycles ./my_program
perf report shows a call-graph (with -g) that reveals exactly where CPU time is
spent, down to the function and even the source line if debug symbols are present.
strace: System Call Tracing
strace intercepts every system call a process makes, which is invaluable for
understanding why a program hangs, crashes, or behaves unexpectedly:
# Trace a running process
strace -p 1234
# Trace a new command and show timing
strace -T -o /tmp/trace.log ls -la /
# Trace only file-related syscalls
strace -e trace=file -p 1234
# Trace only network-related syscalls
strace -e trace=network -p 1234
# Summarise syscall counts and times
strace -c -p 1234
Example use case: a web application is slow. Run strace -c -p <pid> for a few seconds
and discover that 80% of wall-clock time is spent in futex() calls -- indicating lock
contention rather than I/O or CPU.
The USE Method
The USE method, developed by Brendan Gregg, provides a systematic checklist for performance analysis. For every resource (CPU, memory, disk, network interface), check three metrics:
| Metric | Meaning | Example Tool |
|---|---|---|
| Utilisation | Percentage of time the resource is busy | mpstat, iostat %util, sar -n DEV |
| Saturation | Degree to which extra work is queued | vmstat r, iostat aqu-sz, ifconfig txqueuelen |
| Errors | Count of error events | dmesg, ifconfig (errors/drops), smartctl |
Walk through each resource methodically:
- CPU: utilisation via
mpstat -P ALL 1, saturation viavmstat ror load average exceeding core count, errors viadmesg(MCE events). - Memory: utilisation via
free -h/sar -r, saturation viavmstat si/so(swap activity), errors viadmesg(OOM killer, ECC events). - Disk: utilisation via
iostat -xz 1(%util), saturation viaiostat aqu-sz, errors viasmartctl -a /dev/sdaordmesg. - Network: utilisation via
sar -n DEV(compare to link speed), saturation vianetstat -s(retransmits), errors viaip -s link(drops, errors).
This method prevents the common mistake of fixating on one subsystem while the real bottleneck is elsewhere.
Systematic Troubleshooting Workflow
- Define the problem -- "slow" is not specific. Is it latency? Throughput? Error rate? For which users or endpoints?
- Gather baseline metrics --
uptime,dmesg -T | tail,vmstat 1 5,iostat -xz 1 5,free -h. - Apply the USE method to each resource.
- Drill down with
perf,strace, or application-level profiling on the identified bottleneck. - Validate the fix by re-measuring the same metrics after the change.
# A quick one-liner health check script
echo "=== uptime ===" && uptime && \
echo "=== vmstat ===" && vmstat 1 3 && \
echo "=== iostat ===" && iostat -xz 1 3 && \
echo "=== free ===" && free -h && \
echo "=== dmesg tail ===" && dmesg -T | tail -10
For controlling and signalling the processes you discover during analysis, see Process Management. To tune the kernel parameters that govern the resources you are monitoring, continue to Kernel Tuning.
Back to the Linux overview.