Linux Performance Monitoring and Troubleshooting

When a Linux system feels slow, the instinct to blame "the application" or "the hardware" is strong, but the truth usually hides in the data. Effective performance analysis follows a structured methodology, uses the right tools to gather metrics across CPU, memory, disk, and network, and reasons from evidence rather than guesswork. This guide covers the essential command-line tools, the USE method for systematic analysis, and the perf and strace utilities for deep-dive investigation.

Quick-Look Tools

top and htop

top ships with every Linux installation and gives a real-time, refreshing view of processes sorted by resource consumption:

top                            # interactive process viewer
top -bn1 | head -20            # batch mode, one iteration (useful in scripts)

Key fields: %CPU, %MEM, VIRT (virtual memory), RES (resident/physical memory), S (state: R=running, S=sleeping, D=uninterruptible sleep). Press 1 inside top to show per-CPU utilisation; M to sort by memory; P by CPU.

htop is a more user-friendly replacement with colour, mouse support, tree view, and easier filtering:

htop                           # interactive
htop -p 1234,5678              # monitor specific PIDs

vmstat

vmstat provides a concise summary of system-wide virtual memory, CPU, and I/O activity:

vmstat 1 10                    # print stats every 1 second, 10 iterations

Sample output:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 512340  81024 1024320   0    0    12    48  320  580  8  2 89  1  0

Key columns: r (runnable processes, i.e., CPU demand), b (blocked on I/O), si/so (swap in/out -- should be zero on a healthy system), wa (I/O wait percentage), us (user CPU), sy (system/kernel CPU).

iostat

iostat from the sysstat package reports per-device I/O metrics:

iostat -xz 1                   # extended stats, skip idle devices, every 1 sec

Critical columns: %util (how busy the device is; 100% means saturated), await (avg ms per I/O request including queue time), r_await/w_await (read/write latency), aqu-sz (average queue depth).

free and /proc/meminfo

free -h                        # human-readable memory summary

              total   used   free  shared  buff/cache  available
Mem:           31Gi   8.2Gi  1.1Gi  320Mi      22Gi      22Gi
Swap:         8.0Gi     0B   8.0Gi

The available column is the best estimate of memory available for new applications without swapping. For deeper detail:

cat /proc/meminfo | head -15

Key entries: MemTotal, MemAvailable, Buffers, Cached, SwapTotal, SwapFree, Dirty, AnonPages, Slab.

Understanding Load Average

uptime
# 14:23:07 up 42 days, load average: 2.15, 1.80, 1.45

The three numbers represent the 1-minute, 5-minute, and 15-minute exponentially damped moving averages of the number of tasks in a runnable or uninterruptible state. On a system with 4 CPU cores, a load average of 4.0 means the CPUs are exactly fully utilised. Values above the core count indicate tasks are waiting.

Important: on Linux, load average includes processes in uninterruptible I/O sleep (state D), so high load with low CPU can indicate I/O bottlenecks rather than CPU saturation.

sar: Historical Performance Data

The sar command (from sysstat) reads data collected by the sa1/sa2 cronjobs that run every 10 minutes by default:

sar -u 1 5                     # CPU usage, 1-second interval, 5 samples
sar -r                         # memory utilisation for today
sar -d                         # disk I/O for today
sar -n DEV                     # network interface stats for today
sar -q                         # load average and run queue length
sar -f /var/log/sysstat/sa15   # read data from a specific day (the 15th)

sar is invaluable for answering "what happened at 3 AM?" without needing a full monitoring stack.

Deep-Dive Tools

perf: Hardware Performance Counters

The perf tool interfaces with the kernel's performance counter subsystem to provide CPU-level profiling:

# See which functions are hottest system-wide (like top for functions)
perf top

# Record a profile of a specific command
perf record -g -- ./my_program
perf report                    # interactive TUI to browse the profile

# Record a running process for 30 seconds
perf record -g -p 1234 -- sleep 30
perf report

# Count specific events
perf stat -e cache-misses,cache-references,instructions,cycles ./my_program

perf report shows a call-graph (with -g) that reveals exactly where CPU time is spent, down to the function and even the source line if debug symbols are present.

strace: System Call Tracing

strace intercepts every system call a process makes, which is invaluable for understanding why a program hangs, crashes, or behaves unexpectedly:

# Trace a running process
strace -p 1234

# Trace a new command and show timing
strace -T -o /tmp/trace.log ls -la /

# Trace only file-related syscalls
strace -e trace=file -p 1234

# Trace only network-related syscalls
strace -e trace=network -p 1234

# Summarise syscall counts and times
strace -c -p 1234

Example use case: a web application is slow. Run strace -c -p <pid> for a few seconds and discover that 80% of wall-clock time is spent in futex() calls -- indicating lock contention rather than I/O or CPU.

The USE Method

The USE method, developed by Brendan Gregg, provides a systematic checklist for performance analysis. For every resource (CPU, memory, disk, network interface), check three metrics:

Metric	Meaning	Example Tool
Utilisation	Percentage of time the resource is busy	`mpstat`, `iostat %util`, `sar -n DEV`
Saturation	Degree to which extra work is queued	`vmstat r`, `iostat aqu-sz`, `ifconfig txqueuelen`
Errors	Count of error events	`dmesg`, `ifconfig` (errors/drops), `smartctl`

Walk through each resource methodically:

CPU: utilisation via mpstat -P ALL 1, saturation via vmstat r or load average exceeding core count, errors via dmesg (MCE events).
Memory: utilisation via free -h / sar -r, saturation via vmstat si/so (swap activity), errors via dmesg (OOM killer, ECC events).
Disk: utilisation via iostat -xz 1 (%util), saturation via iostat aqu-sz, errors via smartctl -a /dev/sda or dmesg.
Network: utilisation via sar -n DEV (compare to link speed), saturation via netstat -s (retransmits), errors via ip -s link (drops, errors).

This method prevents the common mistake of fixating on one subsystem while the real bottleneck is elsewhere.

Systematic Troubleshooting Workflow

Define the problem -- "slow" is not specific. Is it latency? Throughput? Error rate? For which users or endpoints?
Gather baseline metrics -- uptime, dmesg -T | tail, vmstat 1 5, iostat -xz 1 5, free -h.
Apply the USE method to each resource.
Drill down with perf, strace, or application-level profiling on the identified bottleneck.
Validate the fix by re-measuring the same metrics after the change.

# A quick one-liner health check script
echo "=== uptime ===" && uptime && \
echo "=== vmstat ===" && vmstat 1 3 && \
echo "=== iostat ===" && iostat -xz 1 3 && \
echo "=== free ===" && free -h && \
echo "=== dmesg tail ===" && dmesg -T | tail -10

For controlling and signalling the processes you discover during analysis, see Process Management. To tune the kernel parameters that govern the resources you are monitoring, continue to Kernel Tuning.

Back to the Linux overview.