Laravel Horizon in production — sizing workers, surviving Redis, and the retry strategy that doesn't lose jobs

Most Laravel applications start with QUEUE_CONNECTION=sync and then graduate to database queues when something slow needs offloading. That works for a while. By the time you're moving 50,000 jobs a day, the database queue is contending with your application writes and you've discovered that queue:work started from supervisord doesn't really tell you anything useful about throughput, failure rate, or memory growth.

Horizon is the answer Laravel ships, and it's a good one — but only if you configure it for the workload you actually have. This is the configuration we apply to the managed Laravel applications we run for customers, the numbers we look at, and the retry strategy that keeps jobs from being silently dropped.

The four numbers that drive every Horizon decision

Before any config change, we want four numbers from the customer's existing setup:

Peak jobs per second dispatched (not processed — dispatched)
Median job runtime, by queue
p95 job runtime, by queue
Memory footprint of a single worker after 1,000 jobs

Without these, you're tuning blind. We get them from a week of Horizon metrics, or — if Horizon isn't running yet — from a database queue with jobs table timestamps and failed_jobs for failure rates.

Concrete recent numbers from a SaaS customer:

Peak dispatch: 180 jobs/sec
Median job runtime: 240ms
p95 job runtime: 4.2s (image processing)
Worker memory after 1,000 jobs: 142MB → 168MB (slow leak in a vendor SDK)

These four numbers tell us we need roughly 18 workers to keep p95 latency under 30 seconds, that workers should restart every 1,000 jobs (the memory growth doesn't justify keeping them around longer), and that long-running image jobs need a separate queue so they don't head-of-line-block the fast ones.

Queue isolation — the single biggest win

The most impactful change we make on almost every Horizon migration: split the queue. Default Laravel apps dispatch everything to default, which means a 12-second PDF render is sitting in front of 400 outbound webhooks that each take 80ms.

Our standard split for a typical Laravel SaaS:

// config/horizon.php
'environments' => [
    'production' => [
        'supervisor-default' => [
            'connection' => 'redis',
            'queue' => ['default'],
            'balance' => 'auto',
            'minProcesses' => 2,
            'maxProcesses' => 12,
            'balanceMaxShift' => 2,
            'balanceCooldown' => 3,
            'tries' => 3,
            'timeout' => 60,
            'memory' => 256,
        ],
        'supervisor-realtime' => [
            'connection' => 'redis',
            'queue' => ['realtime'],
            'balance' => 'simple',
            'minProcesses' => 4,
            'maxProcesses' => 8,
            'tries' => 5,
            'timeout' => 15,
            'memory' => 192,
        ],
        'supervisor-heavy' => [
            'connection' => 'redis',
            'queue' => ['pdf', 'imports', 'reports'],
            'balance' => 'auto',
            'minProcesses' => 1,
            'maxProcesses' => 4,
            'tries' => 2,
            'timeout' => 900,
            'memory' => 512,
        ],
        'supervisor-notifications' => [
            'connection' => 'redis',
            'queue' => ['mail', 'sms', 'webhooks'],
            'balance' => 'auto',
            'minProcesses' => 2,
            'maxProcesses' => 16,
            'tries' => 5,
            'timeout' => 30,
            'memory' => 192,
        ],
    ],
],

The four supervisors do four different jobs, with four different timeouts and retry counts. A failing webhook retried 5 times every 90 seconds doesn't starve a PDF render. A 12-minute import doesn't push a password reset email out by 20 minutes.

This split is the first thing we do when onboarding a new managed Laravel customer. Almost universally, it cuts p95 job latency by 60-80% without adding a single CPU.

`balance => 'auto'` and what it actually does

Horizon's auto balancer is the headline feature and the one most teams misuse. It moves worker processes between queues within a supervisor based on which queue is busiest. It does not add workers globally — it shifts them around.

This matters because:

A supervisor with minProcesses: 2 and maxProcesses: 12 will never run fewer than 2 or more than 12, regardless of load
The balanceMaxShift controls how aggressively it adds workers (per "tick")
The balanceCooldown is the wait between adjustments

For a workload with sharp spikes, we use balanceMaxShift: 3 and balanceCooldown: 2. For a workload that's mostly steady, balanceMaxShift: 1 and balanceCooldown: 5 to prevent constant churn. The wrong values here produce a sawtooth pattern where Horizon is spending more time spawning and reaping workers than processing jobs.

Redis as a queue — the gotchas

Redis is excellent as a queue backend. It's also the place where Laravel applications most often run into trouble at scale. Things we configure on the PHP and Redis layer:

Use a separate Redis instance for queues. Sharing cache, session, and queue on one Redis is fine at 10,000 jobs/day. At 1 million jobs/day, a queue backlog spike will evict your cached translations and your sessions, and now your support tickets explode at the same time as your job processing slows. We split.

// config/database.php
'redis' => [
    'cache'    => ['host' => 'redis-cache.internal',    'port' => 6379, 'database' => 0],
    'session'  => ['host' => 'redis-session.internal',  'port' => 6379, 'database' => 0],
    'queue'    => ['host' => 'redis-queue.internal',    'port' => 6379, 'database' => 0],
    'horizon'  => ['host' => 'redis-queue.internal',    'port' => 6379, 'database' => 1],
],

Set maxmemory-policy to noeviction on the queue Redis. The default allkeys-lru will silently evict pending jobs under memory pressure. We've seen this exactly twice in the wild, and both times the customer's first symptom was "jobs are disappearing." If Redis is going to run out of memory, you want it to refuse new writes and page you, not pretend everything's fine while jobs vanish.

Persistence matters. AOF with appendfsync everysec is our default. RDB-only is faster but means up to 15 minutes of jobs can be lost in a Redis crash. For most customers, "the last second of jobs is at risk" is an acceptable trade; "the last 15 minutes" is not.

Monitor redis.info().used_memory_peak. If peak is creeping toward maxmemory, you have a backlog that's accumulating faster than it's being drained — workers can't keep up. Scale workers (or fix the job throughput) before peak hits the ceiling.

Retry strategy — the difference between resilience and amplification

Default Laravel jobs retry on failure. The default is tries=1 (one attempt) or whatever you set with --tries=3 on queue:work. This is fine for transient failures and catastrophic for persistent ones.

The pattern we use:

class SyncCustomerToHubspot implements ShouldQueue
{
    public int $tries = 5;
    public int $backoff = 60; // first retry after 60s
 
    public function backoff(): array
    {
        // Exponential backoff: 60s, 180s, 540s, 1620s, then dead-letter
        return [60, 180, 540, 1620];
    }
 
    public function retryUntil(): \DateTime
    {
        return now()->addHours(6);
    }
 
    public function failed(\Throwable $e): void
    {
        Log::channel('failed_jobs')->error('Hubspot sync exhausted retries', [
            'customer_id' => $this->customerId,
            'exception' => $e->getMessage(),
        ]);
        FailedJobAlert::dispatch('hubspot_sync', $this->customerId);
    }
}

Key elements:

Exponential backoff. A constant 60-second retry against an upstream having a bad day turns into 5 attempts in 5 minutes against a still-failing service. Backoff gives the upstream time to recover.
retryUntil() cap. Even with 5 retries, we stop trying after 6 hours. A job stuck retrying at hour 18 is almost certainly worth a human looking at.
An explicit failed() handler. The failed_jobs table is good but invisible to most teams. We always wire a notification — Slack, PagerDuty, an email — on failed().

For jobs that are not idempotent (taking a payment, sending an SMS), tries should be 1. Better to fail loudly than to charge a customer twice.

Worker memory and `--max-jobs`

Horizon workers are PHP processes. PHP processes leak memory — slowly with well-written code, faster with anything involving GD, Imagick, or large XML parsing. We always set:

'memory' => 256, // MB

And in the supervisor itself:

php artisan horizon --memory=256

Horizon will reap and respawn a worker when it crosses the memory threshold. We've also occasionally set --max-jobs=1000 to force respawn even before memory growth becomes visible, as a defensive measure on customer codebases we don't fully control.

The trade-off: respawning a worker takes 200-400ms. If you're processing 100ms jobs and respawning every 100 jobs, that's a real efficiency hit. We tune --max-jobs to be high enough that the respawn cost is <1% of throughput.

Horizon on multiple hosts

A single Horizon process is a single point of failure. For any customer doing >50,000 jobs/day, we run Horizon on at least two application hosts, both connected to the same Redis. Horizon's supervisor coordinates via Redis, so two instances cooperate naturally — they don't duplicate work.

The deployment caveat: when you deploy, you need to stop Horizon gracefully (horizon:terminate), let in-flight jobs finish, and start the new Horizon. If both instances restart at exactly the same moment, you have a (brief) window with no workers running. We stagger the restart: instance A goes down, comes back up, then instance B does the same. The job queue accumulates briefly on Redis; it drains in seconds.

Things we monitor

For every Laravel customer we run, the alerts that wake us up:

Queue depth > 5,000 for >5 minutes — backlog growing, workers behind
Failed jobs > 50 in 10 minutes — something is broken, not just slow
Horizon master process not running — Supervisor restart failed
Redis used_memory > 75% of maxmemory — accumulation risk
Job wait time p95 > 2x baseline — capacity problem

These all flow through our managed Laravel monitoring stack, but if you're running it yourself, Horizon emits enough metrics via horizon:status and horizon:list to wire all of the above into Prometheus or CloudWatch.

What we ship by default

For Laravel customers on AWS, GCP, or DigitalOcean, every queue-using application gets:

Horizon with the four-supervisor split as a starting point, tuned over the first month
Separate Redis instance for queues, with noeviction and AOF persistence
Automated worker memory + respawn config
Failed-job alerts wired to Slack and PagerDuty
A queue-depth dashboard the customer can see
Documented retry policies per job type, reviewed quarterly

If your Laravel app's queue feels like it's running on hope, the failure mode is rarely dramatic — it's slow, ambient, the kind of degradation where emails arrive 30 minutes late on Tuesday and nobody notices until a customer complains on Friday. Get in touch and we'll spend an hour reading your horizon.php and tell you exactly what to change.

Sudhanshu K. is a Senior Site Reliability Engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). She runs Laravel Horizon for customers handling more than 40 million jobs per day across four clouds.