Preamble

Concurrent bugs laugh at printf. Ordering issues hide until load, timing, and deployment topology align just wrong. September is the month to stop pretending heroics scale: systematic methods and canonical books belong in the same toolkit as traces and benchmarks.


Agans, Debugging

Agans’ habits—reproduce, simplify, read backwards, divide and conquer—apply whether the symptom is a segfault or a stuck BEAM process. For concurrency, “reproduce” often means stress plus fault injection, not a single lucky curl.


Herlihy & Shavit, The Art of Multiprocessor Programming

When locks, lock-free structures, or memory ordering are in play, invariants should be stated in plain language: what must be true before and after each step? The book is dense; even selective chapters upgrade intuition about why a “tiny” race takes weeks to find.


Practices that paid off

  • Record-replay and time-travel debuggers where available—expensive to set up, cheap at incident o’clock.
  • Stress tests with injected crashes (the 2025 workload theme).
  • Sequence numbers in logs so “event A before B” survives async reordering.
  • Distributed traces (OpenTelemetry Traces Across Python and Java) tying HTTP/gRPC spans to internal task lifetimes.

Cross-training runtimes

BEAM observer and tracing plus Rust backtraces plus OpenTelemetry spans shorten unknown-unknowns. No single tool wins; correlation IDs are the glue.


Reproducible incident checklist (printable)

  1. Freeze the world — Git SHA, container digest, OTP/Rust toolchain versions, feature flags, traffic slice.
  2. One graph — Pick one metric that hurts (p99, error rate, mailbox depth); screenshot with absolute time axis.
  3. Minimal replay — Script that replays one failing request against staging with the same seeds as A Language-Agnostic Concurrent Workload for 2025 Comparisons when possible.
  4. Divide — Binary-search components: disable rerank, disable RAG, shrink concurrency—same tactic as Agans’ divide-and-conquer.
  5. Write the invariant — One sentence: “We assumed X; under load Y became false.” File it beside the fix.

Tooling matrix (what to reach for first)

Symptom stack First tool Second tool
BEAM tail latency observer, process_info recon / etop
Tokio stalls tokio-console, tracing spans perf / flamegraph
Cross-service OpenTelemetry trace by request_id Log join on span id

Conclusion

Tools and languages rotate; methodical debugging does not. Architecture for Evolvable Services After Polyglot Expansion returns to architecture when Gleam/Rust workers join Python/Java systems—boundaries matter more than logos.