Supervision Trees and Rust Task Hierarchies

Preamble

OTP supervisors encode restart policy as a first-class language of operations: which children restart, in what order, and when to give up and escalate. Rust services more often compose tokio::spawn, JoinHandle, select!, and explicit loops that decide what “healthy” means. Neither approach removes failure; they route it.

June connects the benchmark crashes from earlier months to how real services stay alive without human hand-holding.

BEAM: let it crash, but on purpose

The BEAM philosophy isolates faults in processes and restarts them under policy. Supervision trees are not decorative—misconfigured intensity and period windows cause restart storms that are worse than a clean crash. Restart limits are documented the same way as circuit breakers in HTTP clients.

Rust: make invalid states hard, then decide restarts

Rust prefers types that rule out bad combinations and tasks with explicit ownership. Panics can abort threads or tasks depending on runtime settings; catch_unwind is a scalpel, not a blanket. For long-running services, a top-level supervisor loop is still warranted that restarts subsystems, backs off, and surfaces metrics—architecture, not hidden magic.

Cancellation propagation via tokio-util or structured shutdown joins Send, Sync, and Fearless Concurrency in Rust’s Send/Sync discussion: tasks must agree on lifetimes.

Comparison without caricature

BEAM restarts are runtime-native; Rust restarts are library- and application-native. Both require design. BEAM-heavy stacks fit when hot code upgrades and OTP idioms match the org; Rust when FFI, predictable native performance, or embedded constraints dominate—often both appear as polyglot edges around a core domain.

Documentation as policy

Restart semantics belong in ADRs alongside API contracts—same instinct as Architecture for Evolvable Services After Polyglot Expansion on ADRs and README-driven records. If on-call cannot answer “what happens when worker N dies,” the design is incomplete.

Code: OTP supervisor vs Rust watchdog loop

OTP (supervisor child spec) — intensity/period cap restart storms:

-module(sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).

start_link() -> supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    SupFlags = #{strategy => one_for_one,
                 intensity => 10,
                 period => 60},
    Child = #{id => worker,
              start => {worker, start_link, []},
              restart => permanent,
              shutdown => 5000,
              type => worker,
              modules => [worker]},
    {ok, {SupFlags, [Child]}}.

Rust (explicit supervisor task) — no VM magic; you own backoff and metrics:

async fn supervise_worker() {
    let mut backoff = Duration::from_millis(50);
    loop {
        let handle = tokio::spawn(async {
            // Workload from spec: recv job, compute, sleep, emit JSONL metric (see Rust/Tokio harness)
        });
        match handle.await {
            Ok(()) => {}
            Err(join_err) => eprintln!("worker panic: {join_err}"),
        }
        tokio::time::sleep(backoff).await;
        backoff = (backoff * 2).min(Duration::from_secs(5));
    }
}

Trade-off table:

	BEAM supervision	Rust watchdog
Pros	Mature defaults, hot code upgrade story, standardized logs	Full control in one language, no external `supervisor` app
Cons	Mis-tuned intensity → flapping	Easy to ship naive loops without metrics unless you add them

Conclusion

Architecture is how failures move through the system. BEAM Scheduler Internals: A Practitioner’s View zooms into BEAM scheduler internals to explain tail latency; Send, Sync, and Fearless Concurrency in Rust maps Rust’s Send/Sync bounds onto shared aggregators from the benchmark.