Preamble
Concurrent bugs laugh at printf. Ordering issues hide until load, timing, and deployment topology align just wrong. September is the month to stop pretending heroics scale: systematic methods and canonical books belong in the same toolkit as traces and benchmarks.
Agans, Debugging
Agans’ habits—reproduce, simplify, read backwards, divide and conquer—apply whether the symptom is a segfault or a stuck BEAM process. For concurrency, “reproduce” often means stress plus fault injection, not a single lucky curl.
Herlihy & Shavit, The Art of Multiprocessor Programming
When locks, lock-free structures, or memory ordering are in play, invariants should be stated in plain language: what must be true before and after each step? The book is dense; even selective chapters upgrade intuition about why a “tiny” race takes weeks to find.
Practices that paid off
- Record-replay and time-travel debuggers where available—expensive to set up, cheap at incident o’clock.
- Stress tests with injected crashes (the 2025 workload theme).
- Sequence numbers in logs so “event A before B” survives async reordering.
- Distributed traces (OpenTelemetry Traces Across Python and Java) tying HTTP/gRPC spans to internal task lifetimes.
Cross-training runtimes
BEAM observer and tracing plus Rust backtraces plus OpenTelemetry spans shorten unknown-unknowns. No single tool wins; correlation IDs are the glue.
Reproducible incident checklist (printable)
- Freeze the world — Git SHA, container digest, OTP/Rust toolchain versions, feature flags, traffic slice.
- One graph — Pick one metric that hurts (p99, error rate, mailbox depth); screenshot with absolute time axis.
- Minimal replay — Script that replays one failing request against staging with the same seeds as A Language-Agnostic Concurrent Workload for 2025 Comparisons when possible.
- Divide — Binary-search components: disable rerank, disable RAG, shrink concurrency—same tactic as Agans’ divide-and-conquer.
- Write the invariant — One sentence: “We assumed X; under load Y became false.” File it beside the fix.
Tooling matrix (what to reach for first)
| Symptom stack | First tool | Second tool |
|---|---|---|
| BEAM tail latency | observer, process_info |
recon / etop |
| Tokio stalls | tokio-console, tracing spans |
perf / flamegraph |
| Cross-service | OpenTelemetry trace by request_id |
Log join on span id |
Conclusion
Tools and languages rotate; methodical debugging does not. Architecture for Evolvable Services After Polyglot Expansion returns to architecture when Gleam/Rust workers join Python/Java systems—boundaries matter more than logos.