A customer's Minecraft server was eating 100% IO last winter. iotop showed java doing "something" at 200 MB/s, which is the same as showing me nothing. strace -p on a JVM with 400 threads is a war crime. I gave up after ten minutes, ran one bpftrace one-liner, and inside thirty seconds I had the exact thread ID and the exact file path it was hammering — turned out to be a misconfigured world-backup plugin rewriting the same region file in a loop.
That was the moment bpftrace stopped being "that BPF thing the kernel devs talk about" and became the first tool I reach for. I want to write down the half-dozen one-liners I actually use before I forget them, because I will absolutely forget them next time I rebuild my workstation.
TL;DR
sudo apt install bpftrace
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
That second line prints every file every process opens, system-wide, in real time. On a busy box it's a firehose — pipe it to grep or add a filter. We'll get to that.
Why bpftrace beats strace and tcpdump on a busy box
strace works by hooking ptrace. Every syscall the target makes causes two context switches into the tracer and back. On an idle test VM that's invisible. On a production box doing 50k syscalls a second per process, you'll watch your latency double the moment you attach. I've crashed services with strace -f more than once.
tcpdump is fine, but it captures packets — it doesn't tell you which process sent them, and on a TLS-encrypted box it tells you almost nothing about intent. You end up cross-referencing ss -tnp and praying the connection still exists.
bpftrace runs your trace logic inside the kernel via eBPF. The kernel filters and aggregates first; userspace only sees the result. No ptrace, no per-syscall trampoline. The overhead is roughly "one extra branch on the probe site," which is so cheap I leave histograms running for hours on production. That's the entire pitch. Everything else is syntax.
If you want to know what's traceable, run:
sudo bpftrace -l 'tracepoint:*' | head
sudo bpftrace -l 'kprobe:tcp_*'
Tracepoints are stable kernel ABIs (won't break between kernel versions). kprobes hook arbitrary kernel functions and will break between kernels — which matters if you're writing a script you want to keep.
The one-liners
1. Who opened which file
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%-16s %-6d %s\n", comm, pid, str(args->filename)); }'
This is the one I use most. Replaces strace -e openat and works on every process at once. Add /comm == "nginx"/ after the probe name to filter:
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat /comm == "nginx"/ { printf("%s\n", str(args->filename)); }'
When a config-reload "didn't take," this tells you within five seconds whether nginx actually re-read the file you edited. Pair this with Netdata for trend monitoring — bpftrace tells you what's happening now, Netdata tells you what happened last Tuesday at 3am.
2. Slowest syscalls per process
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@ns[comm] = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
Hit Ctrl-C and bpftrace prints a power-of-two latency histogram per process name. I use this when something "feels slow" but top doesn't show it. A process that's blocked in read() for 200ms shows up here loud and clear; on top it just looks idle.
3. TCP retransmits
sudo bpftrace -e 'kprobe:tcp_retransmit_skb { printf("retransmit pid=%d comm=%s\n", pid, comm); }'
If your app is "randomly slow" and the network team swears everything's fine, run this for thirty seconds. If retransmits are happening you'll see them. Once you have the comm/pid, dig into the socket with ss -tin. Saved me from a flaky NIC on a Hetzner box that was passing mtr cleanly but dropping every 200th segment under load.
4. Who's calling unlink (file-deletion forensics)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_unlinkat { printf("%-16s %-6d unlink %s\n", comm, pid, str(args->pathname)); }'
"Something is deleting files in /var/lib/foo and I have no idea what." This finds it in one shot. I once caught a logrotate postrotate script killing a socket file every night at 06:25. There's also unlink (the older syscall) on ancient userspace; modern glibc uses unlinkat. Trace both if you want to be thorough. For the post-mortem side of "where did my disk go," see the disk-full forensics post.
5. Block IO latency histogram
sudo bpftrace -e '
kprobe:blk_account_io_start { @s[arg0] = nsecs; }
kprobe:blk_account_io_done /@s[arg0]/ {
@us = hist((nsecs - @s[arg0]) / 1000);
delete(@s[arg0]);
}'
Per-request block-layer latency, in microseconds. When a customer says "the disk feels slow" this shows me whether it's actually slow (a bimodal histogram with a fat tail above 50ms is bad news) or whether they're just impatient. On NVMe most requests live in the 10–100µs bucket. On a tired SATA SSD doing background GC, you'll see a tail crawl up to tens of milliseconds.
On kernels 5.11+ some distros renamed these symbols — if the kprobe fails, try block:block_rq_issue and block:block_rq_complete tracepoints instead, which are stable.
6. exec snoop — every new process
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%-6d %-16s %s\n", pid, comm, str(args->filename)); }'
Leave this running in a tmux pane on a misbehaving box. Every fork+exec shows up. Cron job firing at the wrong time? Visible. Mystery shell script spawning python? Visible. This is also a poor man's intrusion-detection canary; if something pops a reverse shell you'll usually see the chain in here.
7. TCP connect snoop
sudo bpftrace -e 'kprobe:tcp_connect { printf("%-6d %-16s connect\n", pid, comm); }'
Who's making outbound TCP connections, and to where if you parse the sock argument. The pre-baked version that ships with bpftrace (tcpconnect.bt, see below) does the address parsing properly — I rarely write this one from scratch. Useful when an app has a "phone home" you didn't expect, or when you're hunting which container is talking to a DB it shouldn't be.
8. DNS query snoop
sudo bpftrace -e 'kprobe:udp_sendmsg /arg2 == 53 || arg2 == 5353/ { printf("%-16s %-6d dns query\n", comm, pid); }'
Catch every process making a DNS lookup. The filter on port 53 (and 5353 for mDNS) is rough — arg2 here is the destination port on most kernels, but the exact arg index can shift; on newer kernels you may need to dereference the sockaddr from arg1. Still, even the rough version answers "what is hammering my resolver." I caught a Java app doing 4000 lookups/second because someone disabled the JVM DNS cache.
~/usr/share/bpftrace/tools/ — the freebies
Don't write everything from scratch. The bpftrace package ships a /usr/share/bpftrace/tools/ directory full of pre-baked, tested scripts:
execsnoop.bt— production-quality version of #6tcpconnect.bt/tcpaccept.bt— connect/accept with proper address printingbiolatency.bt— the histogram from #5, prettieropensnoop.bt— #1 with arg parsingdcsnoop.bt— directory cache lookupskillsnoop.bt— who is sending which signal to whom
ls /usr/share/bpftrace/tools/ and read them. They're also good to learn the syntax from.
Gotchas
Kernel version. Most of these need 4.9+ for the basic eBPF verifier and 5.x to be pleasant. Anything older than Ubuntu 20.04 LTS, expect pain. tracepoints work earlier than kprobes-with-args. If you're on a CentOS 7 box, give up and use perf or strace with surgical filters.
CO-RE / BTF on older Ubuntu. Compile Once, Run Everywhere relies on BTF kernel debuginfo. Ubuntu shipped this properly from 20.10 onwards; on 18.04/20.04 you may need linux-headers-$(uname -r) or to install BTF from the kernel-debug package. If you see "ERROR: BTF: not found" install linux-image-$(uname -r)-dbgsym from the ddebs repo, or upgrade.
bpftrace itself eats CPU if printf is hot. A printf in a probe that fires a million times a second will saturate one core just formatting strings. Aggregate first (@count[comm] = count();) and let bpftrace print the summary on Ctrl-C, or add a filter. The kernel side is cheap; the userspace ring-buffer drain is not.
Honest disclaimers
- You need root. eBPF needs
CAP_BPF+CAP_PERFMON(or justCAP_SYS_ADMINon older kernels). There's no unprivileged story for tracing. - You need a modern kernel. Already said it, saying it again.
- It does not replace logs. bpftrace is a live tool. The session is gone when you Ctrl-C. If you need historical data, write to a file or feed aggregates into Prometheus via a separate exporter.
- kprobes are not stable. Function names and signatures change between kernels. Tracepoints are the contract; kprobes are the YOLO.
What I keep in ~/scripts/
The honest list of .bt files in my home dir, mostly stolen-and-modified from /usr/share/bpftrace/tools/:
who-opens.bt— #1 with a/comm/filter set as a$1argslow-syscalls.bt— #2 with aprintfsummary on exitretrans.bt— #3 plus the destination address pulled from the sock structunlink-watch.bt— #4 with PID-of-parent printed too, because half the time the answer is "cron"bio-hist.bt— #5, runs for 60s then exits, output piped to a file with timestampdns-loud.bt— #8 with acount()aggregator instead of per-query printf, so I can leave it running
That's it. Half a dozen scripts, one apt package, and most of the "what is this box doing?" questions answer themselves in under a minute. If you've been reaching for strace -f -p $(pidof something) on production, please stop. There's a better way and it's been shipping in your kernel for years.