Preamble
asyncio.gather is the convenience hammer for “run these coroutines concurrently and collect results.” The footgun is failure semantics: with default settings, the first exception propagates and siblings may keep running unless you cancel them. return_exceptions=True flips the story—results become a mix of values and exceptions. Neither is “wrong”; unchosen semantics are wrong.
Choosing gather’s behavior
For independent tasks where partial success matters (fan-out to mirrors), return_exceptions=True lets me decide per result. For all-or-nothing work, I often prefer TaskGroup (3.11+) or explicit cancellation when any child fails—policy belongs in code, not in accident.
Cancellation and resource hygiene
On failure paths, cancel dangling tasks holding sockets or semaphores. “Zombie” coroutines are how you leak connections and then blame asyncio in blog posts. Log task names in structured logs so traces map to intent.
Debugging
When hangs appear, dump all_tasks() in staging, inspect who waits on whom, and verify no blocking call slipped into the event loop thread. The parallel to 2025’s BEAM/NIF discussion is real: blocking the runner starves everyone.
Conclusion
Explicit failure policy is architecture. Virtual Threads: A Mental Model for Massive I/O Concurrency compares this mental model to Java virtual threads—different coloring, same need to classify CPU-bound versus I/O-bound work.