A customer's Minecraft server was eating 100% IO last winter. iotop showed java doing "something" at 200 MB/s, which is the same as showing me nothing. strace -p on a JVM with 400 threads is a war crime. I gave up after ten minutes, ran one bpftrace one-liner, and inside thirty seconds I had the exact thread ID and the exact file path it was hammering — turned out to be a misconfigured world-backup plugin rewriting the same region file in a loop.

That was the moment bpftrace stopped being "that BPF thing the kernel devs talk about" and became the first tool I reach for. I want to write down the half-dozen one-liners I actually use before I forget them, because I will absolutely forget them next time I rebuild my workstation.

TL;DR

sudo apt install bpftrace
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

That second line prints every file every process opens, system-wide, in real time. On a busy box it's a firehose — pipe it to grep or add a filter. We'll get to that.

Why bpftrace beats strace and tcpdump on a busy box

strace works by hooking ptrace. Every syscall the target makes causes two context switches into the tracer and back. On an idle test VM that's invisible. On a production box doing 50k syscalls a second per process, you'll watch your latency double the moment you attach. I've crashed services with strace -f more than once.

tcpdump is fine, but it captures packets — it doesn't tell you which process sent them, and on a TLS-encrypted box it tells you almost nothing about intent. You end up cross-referencing ss -tnp and praying the connection still exists.

bpftrace runs your trace logic inside the kernel via eBPF. The kernel filters and aggregates first; userspace only sees the result. No ptrace, no per-syscall trampoline. The overhead is roughly "one extra branch on the probe site," which is so cheap I leave histograms running for hours on production. That's the entire pitch. Everything else is syntax.

If you want to know what's traceable, run:

sudo bpftrace -l 'tracepoint:*' | head
sudo bpftrace -l 'kprobe:tcp_*'

Tracepoints are stable kernel ABIs (won't break between kernel versions). kprobes hook arbitrary kernel functions and will break between kernels — which matters if you're writing a script you want to keep.

The one-liners

1. Who opened which file

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%-16s %-6d %s\n", comm, pid, str(args->filename)); }'

This is the one I use most. Replaces strace -e openat and works on every process at once. Add /comm == "nginx"/ after the probe name to filter:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'

When a config-reload "didn't take," this tells you within five seconds whether nginx actually re-read the file you edited. Pair this with Netdata for trend monitoring — bpftrace tells you what's happening now, Netdata tells you what happened last Tuesday at 3am.

2. Slowest syscalls per process

sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @ns[comm] = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

Hit Ctrl-C and bpftrace prints a power-of-two latency histogram per process name. I use this when something "feels slow" but top doesn't show it. A process that's blocked in read() for 200ms shows up here loud and clear; on top it just looks idle.

3. TCP retransmits

sudo bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retransmit pid=%d comm=%s\n", pid, comm); }'

If your app is "randomly slow" and the network team swears everything's fine, run this for thirty seconds. If retransmits are happening you'll see them. Once you have the comm/pid, dig into the socket with ss -tin. Saved me from a flaky NIC on a Hetzner box that was passing mtr cleanly but dropping every 200th segment under load.

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_unlinkat { printf("%-16s %-6d unlink %s\n", comm, pid, str(args->pathname)); }'

"Something is deleting files in /var/lib/foo and I have no idea what." This finds it in one shot. I once caught a logrotate postrotate script killing a socket file every night at 06:25. There's also unlink (the older syscall) on ancient userspace; modern glibc uses unlinkat. Trace both if you want to be thorough. For the post-mortem side of "where did my disk go," see the disk-full forensics post.

5. Block IO latency histogram

sudo bpftrace -e '
kprobe:blk_account_io_start { @s[arg0] = nsecs; }
kprobe:blk_account_io_done /@s[arg0]/ {
  @us = hist((nsecs - @s[arg0]) / 1000);
  delete(@s[arg0]);
}'

Per-request block-layer latency, in microseconds. When a customer says "the disk feels slow" this shows me whether it's actually slow (a bimodal histogram with a fat tail above 50ms is bad news) or whether they're just impatient. On NVMe most requests live in the 10–100µs bucket. On a tired SATA SSD doing background GC, you'll see a tail crawl up to tens of milliseconds.

On kernels 5.11+ some distros renamed these symbols — if the kprobe fails, try block:block_rq_issue and block:block_rq_complete tracepoints instead, which are stable.

6. exec snoop — every new process

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%-6d %-16s %s\n", pid, comm, str(args->filename)); }'

Leave this running in a tmux pane on a misbehaving box. Every fork+exec shows up. Cron job firing at the wrong time? Visible. Mystery shell script spawning python? Visible. This is also a poor man's intrusion-detection canary; if something pops a reverse shell you'll usually see the chain in here.

7. TCP connect snoop

sudo bpftrace -e 'kprobe:tcp_connect { printf("%-6d %-16s connect\n", pid, comm); }'

Who's making outbound TCP connections, and to where if you parse the sock argument. The pre-baked version that ships with bpftrace (tcpconnect.bt, see below) does the address parsing properly — I rarely write this one from scratch. Useful when an app has a "phone home" you didn't expect, or when you're hunting which container is talking to a DB it shouldn't be.

8. DNS query snoop

sudo bpftrace -e 'kprobe:udp_sendmsg /arg2 == 53 || arg2 == 5353/ { printf("%-16s %-6d dns query\n", comm, pid); }'

Catch every process making a DNS lookup. The filter on port 53 (and 5353 for mDNS) is rough — arg2 here is the destination port on most kernels, but the exact arg index can shift; on newer kernels you may need to dereference the sockaddr from arg1. Still, even the rough version answers "what is hammering my resolver." I caught a Java app doing 4000 lookups/second because someone disabled the JVM DNS cache.

~/usr/share/bpftrace/tools/ — the freebies

Don't write everything from scratch. The bpftrace package ships a /usr/share/bpftrace/tools/ directory full of pre-baked, tested scripts:

  • execsnoop.bt — production-quality version of #6
  • tcpconnect.bt / tcpaccept.bt — connect/accept with proper address printing
  • biolatency.bt — the histogram from #5, prettier
  • opensnoop.bt — #1 with arg parsing
  • dcsnoop.bt — directory cache lookups
  • killsnoop.bt — who is sending which signal to whom

ls /usr/share/bpftrace/tools/ and read them. They're also good to learn the syntax from.

Gotchas

Kernel version. Most of these need 4.9+ for the basic eBPF verifier and 5.x to be pleasant. Anything older than Ubuntu 20.04 LTS, expect pain. tracepoints work earlier than kprobes-with-args. If you're on a CentOS 7 box, give up and use perf or strace with surgical filters.

CO-RE / BTF on older Ubuntu. Compile Once, Run Everywhere relies on BTF kernel debuginfo. Ubuntu shipped this properly from 20.10 onwards; on 18.04/20.04 you may need linux-headers-$(uname -r) or to install BTF from the kernel-debug package. If you see "ERROR: BTF: not found" install linux-image-$(uname -r)-dbgsym from the ddebs repo, or upgrade.

bpftrace itself eats CPU if printf is hot. A printf in a probe that fires a million times a second will saturate one core just formatting strings. Aggregate first (@count[comm] = count();) and let bpftrace print the summary on Ctrl-C, or add a filter. The kernel side is cheap; the userspace ring-buffer drain is not.

Honest disclaimers

  • You need root. eBPF needs CAP_BPF + CAP_PERFMON (or just CAP_SYS_ADMIN on older kernels). There's no unprivileged story for tracing.
  • You need a modern kernel. Already said it, saying it again.
  • It does not replace logs. bpftrace is a live tool. The session is gone when you Ctrl-C. If you need historical data, write to a file or feed aggregates into Prometheus via a separate exporter.
  • kprobes are not stable. Function names and signatures change between kernels. Tracepoints are the contract; kprobes are the YOLO.

What I keep in ~/scripts/

The honest list of .bt files in my home dir, mostly stolen-and-modified from /usr/share/bpftrace/tools/:

  • who-opens.bt — #1 with a /comm/ filter set as a $1 arg
  • slow-syscalls.bt — #2 with a printf summary on exit
  • retrans.bt — #3 plus the destination address pulled from the sock struct
  • unlink-watch.bt — #4 with PID-of-parent printed too, because half the time the answer is "cron"
  • bio-hist.bt — #5, runs for 60s then exits, output piped to a file with timestamp
  • dns-loud.bt — #8 with a count() aggregator instead of per-query printf, so I can leave it running

That's it. Half a dozen scripts, one apt package, and most of the "what is this box doing?" questions answer themselves in under a minute. If you've been reaching for strace -f -p $(pidof something) on production, please stop. There's a better way and it's been shipping in your kernel for years.


Related posts