Preamble

Celery moves work off the web tier into worker processes—message passing across machines, spiritually similar to OTP actors but with different delivery and failure guarantees. I reach for it when tasks are long, bursty, or need horizontal scale without tying up HTTP threads.

The sections below spell out how to wire Celery in Python, which concurrency knobs actually matter, and how that model lines up against BEAM/OTP and greenthread-style runtimes.


Step-by-step: Celery in Python

1. Dependencies and layout

Install Celery and a broker client (Redis is the shortest path for learning):

pip install "celery[redis]"

A minimal package layout keeps the application object importable from both the web process and workers:

myapp/
  celery_app.py    # Celery() instance and broker config
  tasks.py         # @app.task definitions

Workers and the API must share the same broker URL and same task names (usually the same codebase or a shared package).

2. Create the application

celery_app.py holds the single Celery instance. Point broker at Redis (or RabbitMQ’s AMQP URL), and optionally set a result backend if you use AsyncResult or .get():

# myapp/celery_app.py
from celery import Celery

app = Celery(
    "myapp",
    broker="redis://localhost:6379/0",
    backend="redis://localhost:6379/1",  # optional; omit if you never read results
    include=["myapp.tasks"],
)

app.conf.update(
    task_serializer="json",
    accept_content=["json"],
    result_serializer="json",
    timezone="UTC",
    enable_utc=True,
    task_track_started=True,
    task_time_limit=300,       # hard kill after 5 minutes
    task_soft_time_limit=270,  # SoftTimeLimitExceeded for cleanup first
)

include lists modules that register tasks when the worker starts. Without it, workers can boot with no tasks registered.

3. Define tasks

Tasks are plain callables decorated with @app.task. The first argument is never self in the usual sense; bind only if you need self for retries or request metadata (bind=True).

# myapp/tasks.py
from myapp.celery_app import app

@app.task
def add(x, y):
    return x + y

@app.task(bind=True, autoretry_for=(ConnectionError,), retry_backoff=True, max_retries=5)
def fetch_external(self, url):
    ...

Naming: the default task name is the import path (myapp.tasks.add). Renames or duplicate module paths break already-enqueued messages; treat task names as part of your public API.

4. Run a worker

From the project root (so myapp is on PYTHONPATH):

celery -A myapp.celery_app worker --loglevel=info

Useful flags in production:

  • --concurrency=N: pool size (meaning depends on pool type; see below).
  • -Q queue1,queue2: consume only named queues.
  • -n worker1@%h: unique worker hostname for inspection.
  • --max-tasks-per-child (prefork): recycle workers to mitigate memory leaks in C extensions.

5. Call tasks from application code

.delay(*args, **kwargs) and .apply_async(...) enqueue work; they return an AsyncResult if a backend is configured.

from myapp.tasks import add

add.delay(2, 3)
add.apply_async(args=(2, 3), countdown=10, queue="priority")

The web request should return quickly; long .get() calls inside HTTP handlers defeat the purpose.

6. Periodic work (optional)

Celery Beat is a separate scheduler process that enqueues tasks on a schedule. Run beat with the same app and broker:

celery -A myapp.celery_app beat --loglevel=info

Prefer idempotent periodic tasks: clocks skew, restarts, and duplicates are normal in distributed systems.

7. Brokers and backends in practice

Redis is simple and fast for many workloads; you trade some durability nuance versus RabbitMQ (AMQP), which fits stricter ordering, routing, and policy needs. The broker is infrastructure: monitored, access-controlled, and sized for peak enqueue rate and depth.

Result backends store return values and task state. If you only need fire-and-forget side effects, you can omit the backend and avoid another failure domain.


Worker pools: where “concurrency” actually lives

Celery’s worker is a process that hosts a pool of execution units. The pool type decides whether you get true parallelism, threads, or greenthreads.

Pool Units Parallel CPU-bound work on CPython Typical use
prefork (default) OS processes Yes (separate interpreters) CPU work, isolation, default
threads OS threads Limited by GIL for pure Python Mostly I/O-bound, shared memory
solo Single thread No extra parallelism Debugging, one-task-at-a-time
gevent / eventlet Greenthreads No extra CPU parallelism Many blocking I/O calls patched to yield

Example:

celery -A myapp.celery_app worker --pool=prefork --concurrency=4
celery -A myapp.celery_app worker --pool=gevent --concurrency=100

prefork spins up N child processes. Each child is a full CPython interpreter: no shared GIL between children, so CPU-bound Python can scale across cores within one machine. Cost: higher memory per worker and serialization of task arguments.

gevent / eventlet monkey-patch the standard library so blocking socket calls cooperatively yield. Throughput for many concurrent I/O-bound tasks can rise dramatically in one OS process, but CPU-heavy Python still contends on the GIL, and not all libraries are patch-safe.

threads pool uses OS threads in one process: good when tasks release the GIL (I/O, many C extensions) or when work is light; poor for CPU-bound pure Python scaling.

Nothing here replaces horizontal scaling: more machines, more worker processes, same queue.


How this compares to BEAM (Erlang/Elixir on OTP)

Similarities

  • Message passing: producers send opaque messages; consumers handle them asynchronously. Conceptually close to send/receive between processes.
  • Failure isolation in spirit: prefork workers resemble separate VMs in the weak sense that a segfault in one child does not crash siblings (the main worker process may respawn children depending on configuration).
  • Distribution: you can run workers on many hosts against one broker, like nodes in a cluster.

Differences

  • Scheduling and preemption: BEAM preempts lightweight processes; a tight loop cannot starve others. Celery child processes are OS processes scheduled by the kernel; greenthread pools are cooperative—a CPU loop blocks the whole pool until it yields.
  • Supervision: OTP supervision trees restart specific failed workers with declared strategies. Celery gives you process restarts and retries at the task level, not a uniform tree of typed workers with transient/permanent semantics.
  • Memory model: Erlang processes are isolated heaps with copying messages (or shared binaries). Celery passes serialized blobs through Redis/RabbitMQ; large payloads stress the broker and deserialization cost.
  • Delivery guarantees: brokers give at-least-once unless you build exactly-once semantics yourself (often impossible end-to-end). BEAM distribution is a different stack; both require you to design idempotent handlers for production.

Celery is Python’s pragmatic bridge to distributed task queues; BEAM is a runtime + language + OTP designed around concurrent fault-tolerant systems from the ground up.


How this compares to greenthreads (and asyncio)

Greenthreads (gevent, eventlet, Stackless patterns) multiplex many logical tasks on few OS threads by explicit or patched yield points. asyncio is single-threaded cooperative multitasking with await boundaries instead of monkey-patching.

Compared to in-process greenthreads/asyncio

  • Celery adds persistence, cross-service boundaries, and crash isolation between worker processes. A bad deploy on one worker does not share address space with your web app.
  • Latency: enqueue + fetch + deserialize is milliseconds to tens of ms (or worse under load)—far above an in-memory await or greenlet.switch.
  • Shared state: in one process, greenthreads/asyncio can share memory (carefully). Celery tasks should assume no shared memory; pass data in messages or read from databases.

When Celery wins

  • Work must survive process restarts and spike without overloading the web tier.
  • You need multiple hosts and independent scaling of API vs workers.
  • Tasks call blocking libraries you do not want inside an async event loop.

When in-process concurrency wins

  • Sub-millisecond coordination, fine-grained fan-out, or huge fan-in on one machine.
  • You control the stack (asyncio-native clients, or gevent-safe I/O).

Many systems use both: asyncio (or threads) in the API for I/O, Celery for durable background work.


Idempotency is mandatory

At-least-once delivery means retries. Tasks that charge cards or send email twice without dedupe are incidents waiting to happen. Patterns: stable task IDs, dedupe keys, transactional outbox before enqueue, database uniqueness constraints on side effects.


Observability

Structured logs with task_id and correlation to HTTP requests (OpenTelemetry Traces Across Python and Java) make “why did this job run twice?” answerable. Celery’s own metrics (queue depth, worker health) belong on dashboards.


Conclusion

Celery gives you queue-backed, horizontally scalable concurrency with explicit trade-offs: serialization cost, at-least-once semantics, and pool choice (prefork vs greenthread pools) that mirror broader lessons about OS processes, the GIL, and cooperative scheduling. BEAM/OTP remains the tighter integration of language + VM + supervision; Celery remains the boring integration of Python + broker + workers that fits most CRUD-and-jobs architectures. Kafka Producers and Consumers: Python and Java Notes adds Kafka for event streams across Python and Java.