Celery in production — broker choice, retry semantics, and what Flower actually tells you

Celery is the default answer to "we need background jobs in Python," and for good reason. It has been the de-facto choice for over a decade, the ecosystem is mature, and it integrates with everything. It is also one of the most consistently mis-deployed pieces of Python infrastructure we see. The patterns that worked for a 2-developer startup at 5 jobs per minute do not survive contact with 50 jobs per second, multi-region workers, and tasks that occasionally take 40 minutes.

This post is the playbook we apply to every Celery deployment we run on managed Python — broker choice, queue topology, task idempotency, retry policy, and what to actually monitor.

Broker choice: Redis vs RabbitMQ in 2026

Both are supported first-class brokers. The decision is not as one-sided as the loudest opinions on either side suggest.

Redis is simpler. It is almost always already in your stack as a cache. Latency is lower (sub-millisecond enqueue). Operating it is trivial — one process, one config file, one persistence model to think about. We use Redis as the Celery broker for roughly 70% of customer deployments.

The trade-offs:

Persistence is best-effort. With appendonly yes and appendfsync everysec, you can lose up to one second of enqueued tasks on a hard crash. For most workloads this is fine. For "this task triggers a customer-facing email and must not be lost," it is not.
No native priority queues, no native dead-letter exchanges. You implement these as separate Redis keys/queues by convention. It works, but you are building primitives RabbitMQ ships.
Visibility timeout instead of acknowledgements. Redis-backed Celery uses a visibility timeout: if a worker pulls a task and does not finish in visibility_timeout seconds (default 1 hour), the task becomes visible again and another worker picks it up. Set this lower than your longest-running task and you get duplicate execution. Set it longer and crashed workers leave their tasks dangling for an hour.

RabbitMQ is the right choice when:

You need durable persistence with acknowledgement semantics — a task is not removed from the queue until the worker explicitly acks it after completion.
You want priority queues, TTLs per message, dead-letter exchanges, fan-out to multiple consumers.
You are running more than ~10,000 tasks/second sustained, where Redis single-threaded throughput starts to bite.

The operational overhead is real. RabbitMQ has more knobs (vhosts, exchange types, queue mirroring/quorum queues), more failure modes (split-brain, queue master failover), and more memory pressure considerations. We typically only run it for customers whose workload demands it. When we do, we use quorum queues rather than the legacy mirrored classic queues — quorum queues are the modern, Raft-based, safer default.

For the rest, Redis with appendonly yes, appendfsync everysec, and AOF persistence pinned to a network-attached disk is good enough.

Queue topology: do not put everything on `celery`

The single most common pathology: every task ends up on the default queue named celery, every worker consumes from that queue, and a flood of slow tasks (say, PDF generation) starves out fast tasks (say, sending password-reset emails).

Our default topology is queue-by-task-class:

# celery_app.py
from celery import Celery
 
app = Celery("myapp")
app.conf.task_routes = {
    "myapp.tasks.email.*":     {"queue": "email"},     # fast, latency-sensitive
    "myapp.tasks.reports.*":   {"queue": "reports"},   # slow, batch
    "myapp.tasks.webhooks.*":  {"queue": "webhooks"},  # I/O bound, third-party
    "myapp.tasks.ml.*":        {"queue": "ml"},        # CPU/GPU heavy
}
 
app.conf.task_default_queue = "default"

Then run distinct worker pools per queue, sized differently:

# Fast lane — high concurrency, gevent for I/O concurrency, no prefetch
celery -A myapp worker -Q email,webhooks -P gevent -c 200 --prefetch-multiplier=1
 
# CPU lane — process-based, low concurrency, fair scheduling
celery -A myapp worker -Q ml -P prefork -c 2 --prefetch-multiplier=1
 
# Batch lane — moderate concurrency, longer time limit
celery -A myapp worker -Q reports -P prefork -c 4 --time-limit=3600

The --prefetch-multiplier=1 is non-negotiable for any queue with non-uniform task duration. The default is 4, meaning each worker reserves 4 tasks at a time. If three of those tasks are 30-second jobs and one is a 5ms job, the 5ms job waits in line behind the slow ones — on a specific worker — even though other workers are idle. Prefetch 1 means tasks are pulled one at a time, and the scheduler can spread them across workers properly.

Task design: idempotent or you have a bug

The single rule we enforce: every Celery task must be idempotent. Running it twice with the same arguments must produce the same final state as running it once. This is not a stylistic preference. Celery (with any broker) provides at-least-once delivery; "exactly once" does not exist over a network with unreliable workers.

In practice, this means:

@app.task(bind=True, max_retries=5)
def charge_customer(self, order_id: str):
    order = Order.objects.get(id=order_id)
    if order.charged_at is not None:
        # Already charged — return without re-charging
        return order.charge_id
    
    idempotency_key = f"charge-{order_id}"
    charge = stripe.Charge.create(
        amount=order.total_cents,
        currency="aud",
        source=order.customer.stripe_token,
        idempotency_key=idempotency_key,
    )
    order.charge_id = charge.id
    order.charged_at = timezone.now()
    order.save()
    return charge.id

Three idempotency layers stacked: the database charged_at check, the Stripe idempotency_key (so even if we hit Stripe twice, only one charge happens), and the explicit return if the work is already done. Any one of these alone would be insufficient; together they make duplicate execution safe.

The cost of building idempotency in is one engineering decision per task. The cost of not building it in is one production incident per task, plus a refund process.

Retry policy: exponential backoff with jitter

The Celery default retry is "retry immediately." This is almost always wrong. When a task fails because a downstream API is having an outage, retrying immediately just adds to the load on that API and amplifies the outage.

Our default retry decorator:

@app.task(
    bind=True,
    autoretry_for=(requests.RequestException, ConnectionError),
    retry_backoff=True,          # exponential: 1s, 2s, 4s, 8s, ...
    retry_backoff_max=600,       # cap individual delay at 10 minutes
    retry_jitter=True,           # ±25% jitter
    max_retries=8,               # total elapsed time ~30 minutes
    acks_late=True,
)
def call_external_api(self, payload):
    response = requests.post(THIRDPARTY_URL, json=payload, timeout=10)
    response.raise_for_status()
    return response.json()

Key points:

autoretry_for is restrictive. We retry only on the exceptions we know are transient (network, 5xx). We do not retry on ValueError or business-logic exceptions — those mean the task is broken and retrying does not help.
retry_backoff + retry_jitter prevents synchronised retry storms. Without jitter, all the tasks that failed at the same moment retry at the same moment.
acks_late=True changes the ack semantics: the task is only acknowledged after the worker successfully completes. If the worker crashes mid-task, the broker redelivers. This is what makes "worker pod gets evicted" survivable.

For tasks where retry must eventually give up cleanly:

@app.task(bind=True, max_retries=8, ...)
def send_notification(self, user_id, message):
    try:
        notification_service.send(user_id, message)
    except notification_service.TransientError as exc:
        raise self.retry(exc=exc)
    except notification_service.PermanentError:
        # Don't retry — log to dead-letter, alert humans
        dead_letter.record("send_notification", user_id, message)
        return

Monitoring: Flower, but also Prometheus

Flower is the canonical Celery dashboard. We run it for every customer, but it is not enough by itself.

What Flower gives you:

Live worker count and per-worker active tasks
Task counts (succeeded, failed, retried, revoked) by name
A searchable history of recent task executions
Real-time broker queue length

What Flower does not give you well:

Long-term historical trends — Flower's storage is in-memory by default
Alerting on queue depth, task latency, or worker count drops
Per-queue saturation views

We pair Flower with celery-prometheus-exporter (or, more recently, celery-exporter) for metrics and Grafana for dashboards. The metrics that matter:

celery_queue_length{queue="..."} — alert when sustained > 500 for more than 5 minutes. This is your "workers are falling behind" signal.
celery_task_runtime_seconds_bucket{task="..."} p95 and p99 — alert on regressions, not absolutes.
celery_workers by queue — alert if it drops below the expected count for >2 minutes.
celery_task_failed_total rate — alert on spikes, not background failures.

We also instrument tasks with OpenTelemetry so a slow task can be traced end-to-end through whatever it called downstream. Distributed tracing across Celery boundaries (using apply_async headers to propagate trace context) is the single best diagnostic when "the report task got slow" needs to be answered with "because the database query inside it got slow because the index changed."

Beat: do not run two

celery beat is the periodic task scheduler. It is a single process, by design. If you run two celery beat instances "for redundancy," you get duplicate executions of every scheduled task.

The supported patterns for HA beat:

Pin beat to one specific pod with a Deployment of replicas: 1 and a topology that ensures it restarts cleanly.
Use celery-redbeat if you need genuinely active-active beat with Redis-based locking.
Use Kubernetes CronJobs instead of celery beat for anything that does not need beat-specific features. CronJobs handle the leader election problem at the Kubernetes layer, which is where it belongs.

We default to option three for new deployments. CronJobs that enqueue a Celery task via apply_async are operationally simpler, observable through Kubernetes-native tooling, and survive pod restarts cleanly.

What we ship by default

For Celery deployments we operate on GCP, AWS, or any other cloud:

Redis or RabbitMQ broker, provisioned with persistence, replication, and monitored backups.
A separate queue per task class, with a separate worker pool per queue, sized from observed load.
A standard retry policy template (autoretry_for, retry_backoff, retry_jitter, acks_late=True) applied to every new task.
Flower behind SSO, plus celery-exporter feeding Prometheus and Grafana, plus alerts on queue depth, worker count, and failure rate.
Idempotency review as part of every new-task PR — we treat non-idempotent Celery tasks as a code-review blocker.

Celery is not glamorous, and the patterns above are not novel. They are the patterns that make the difference between a Celery setup that hums along quietly for years and one that wakes someone up every fortnight. Most teams arrive at these the hard way; we arrived at them the hard way once and now apply them everywhere.

Sudhanshu K. is a Staff DevOps engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). He has personally debugged enough duplicate-charge incidents to be unreasonable about idempotency.