The Node.js memory leak playbook — heap snapshots, clinic.js, and the four patterns we keep finding

The phone call always sounds the same. "Our Node app is restarting every six hours." Or every two hours. Or every twenty minutes, if it's a particularly bad week. The container's OOM killer is doing its job, PM2 or systemd or Kubernetes is dutifully resurrecting the process, and someone has put a sticky note over the alert because "it's a known issue."

It is almost never a known issue. It is almost always one of four patterns we'll get to below. This is the diagnosis playbook we run on managed Node.js workloads when a customer rings us up about a leak.

Step zero: confirm it's actually a leak

Node processes growing in memory is not, by itself, a leak. V8's heap is generational and lazy about GC. A healthy Node app on a 2GB container will frequently sit at 1.4GB for hours, then drop to 600MB after a major GC, then climb again. That's not a leak. That's a working set plus deferred collection.

A real leak looks like this: RSS climbs monotonically across multiple GC cycles, with no sawtooth pattern. The cleanest way to see it is to plot process.memoryUsage() over time:

setInterval(() => {
  const m = process.memoryUsage();
  console.log(JSON.stringify({
    ts: Date.now(),
    rss: m.rss,
    heapUsed: m.heapUsed,
    heapTotal: m.heapTotal,
    external: m.external,
    arrayBuffers: m.arrayBuffers,
  }));
}, 10_000);

Ship that to your logging stack for an hour. If heapUsed after each major GC keeps climbing — not just rss, but heapUsed post-collection — you have a leak. If only rss climbs while heapUsed is flat, you have native or Buffer growth (the third pattern below), which is also a leak but lives in a different place.

We force a few GCs during the observation by running the process with --expose-gc and calling global.gc() periodically. That removes the ambiguity about whether the heap is high because of garbage or because of retained objects.

The tools we actually reach for

There is a long list of Node profiling tools. In practice we use four, in this order:

process.memoryUsage() polled into logs — for "is this a leak at all."
--inspect with Chrome DevTools — for taking heap snapshots in a live process and comparing them.
clinic.js (clinic doctor, clinic heapprofiler) — for unattended profiling runs in CI or staging.
v8.writeHeapSnapshot() — for grabbing a snapshot from a running production process without exposing a debug port.

The fourth one is underused. It writes a .heapsnapshot file you can later open in Chrome DevTools, and it works with no inspector, no port exposure, no security headache. We trigger it with SIGUSR2:

const v8 = require('node:v8');
const path = require('node:path');
 
process.on('SIGUSR2', () => {
  const file = path.join('/tmp', `heap-${process.pid}-${Date.now()}.heapsnapshot`);
  v8.writeHeapSnapshot(file);
  console.log(`heap snapshot written: ${file}`);
});

kill -USR2 <pid> and you have a snapshot. Take two, ten minutes apart, with the app under load. Open both in DevTools, use the "Comparison" view, sort by "Delta," and the top retainers are usually the leak.

For managed customers running Node on AWS or GCP, we wire this into every container by default. The cost is zero when not used; the value when there's an incident is the difference between "we can diagnose this in an hour" and "let's just bump the memory limit again."

Pattern 1: the unbounded module-level Map

The single most common leak. Somebody, somewhere, wrote this:

const cache = new Map();
 
async function getUser(id) {
  if (cache.has(id)) return cache.get(id);
  const user = await db.users.findById(id);
  cache.set(id, user);
  return user;
}

The intent is reasonable. The cache never evicts. Every user that has ever been requested in the lifetime of the process is retained, forever, with all their nested associations. After a week of uptime, the cache contains 800,000 users and the heap snapshot's biggest retainer is a single Map at the module level.

The diagnosis: in the heap snapshot, "Containment" view, look under (GC roots) -> (Global) -> <module>. Anything at module scope that grows is a candidate. Sort by "Retained Size." If a single Map, Object, or Array is retaining tens of megabytes, you have your culprit.

The fix is a bounded cache. Either swap to an LRU library like lru-cache:

const { LRUCache } = require('lru-cache');
const cache = new LRUCache({ max: 5_000, ttl: 1000 * 60 * 5 });

Or, better, move the cache out of the Node process entirely into Redis or Memcached, where eviction is somebody else's problem.

Adjacent variant of this same pattern: arrays that are pushed to but never shifted. We've seen const events = []; events.push(event) in an event handler with no consumer. After a million events you have a fifty-megabyte array.

Pattern 2: the closure over a request

A subtler one, and harder to spot. A handler captures the request object in a closure that outlives the request:

app.get('/report/:id', async (req, res) => {
  scheduleReportEmail(req.params.id, () => {
    // closure captures req, res, and everything in the request scope
    sendEmailFor(req.params.id, req.user);
  });
  res.json({ scheduled: true });
});

scheduleReportEmail puts the callback into an in-memory queue. The closure holds req and res, which hold socket buffers, parsed bodies, and the entire request context. Multiply by a few hundred requests an hour and the heap is dominated by request objects that should have been garbage decades ago.

In a heap snapshot, this looks like an outsized count of IncomingMessage and ServerResponse objects. There should be roughly one per concurrent request. If you see ten thousand, you have closures leaking request scope.

The fix is to capture only what you need:

app.get('/report/:id', async (req, res) => {
  const id = req.params.id;
  const userId = req.user.id;
  scheduleReportEmail(id, () => sendEmailFor(id, userId));
  res.json({ scheduled: true });
});

Yes, it's a one-line refactor. Yes, it cuts the heap by 40% in the cases we've seen.

Pattern 3: external memory — Buffers, sharp, native add-ons

The trickiest leaks live outside the V8 heap, in external and arrayBuffers. process.memoryUsage().heapUsed is steady; rss keeps climbing.

The usual suspects:

sharp for image processing — if you don't call .destroy() or let the pipeline complete, the libvips buffers leak.
pdfkit, puppeteer, playwright — anything that wraps a native engine.
Raw Buffer.alloc() retained in a queue.
pg query results with very large bytea columns held in a connection-level cache.

These don't show up in standard heap snapshots because the snapshot is V8-only. The trick is to use clinic doctor, which traces both heap and RSS over time, and --track-heap-objects plus --max-old-space-size set low enough that V8 reports growth before the OS kills you.

For native add-on leaks, the diagnostic is empirical: disable the suspect module, see if the leak stops. Then file an issue upstream or pin the previous version. We have a list of versions of sharp, node-canvas, and a few others we've banned from our managed Node.js stack until specific commits land upstream.

Pattern 4: event emitter accumulation

You added a .on('data', ...) listener and never removed it. Or you wrapped a global EventEmitter in a per-request handler. Node helpfully warns at 11 listeners with MaxListenersExceededWarning, but only if you haven't suppressed the warning, which a surprising number of codebases have.

The heap snapshot signature: a single EventEmitter with a _events property containing an array of thousands of identical-looking listener functions. Each listener closes over its own scope, so the retained size is large.

The fix is .off() (or .removeListener()) in the cleanup path — and crucially, the cleanup path needs to run on the error case, not just the happy case. We use AbortSignal-driven cleanup for this:

const controller = new AbortController();
emitter.on('data', onData, { signal: controller.signal });
// later, in finally or error handler:
controller.abort();

This is the modern Node idiom and it removes a class of leaks entirely.

What we ship by default

When we onboard a Node app onto managed operations, every process gets:

process.memoryUsage() polled to logs every 30 seconds, parsed into a Grafana dashboard
SIGUSR2 handler for on-demand heap snapshots, dumped to a writable scratch volume
A weekly automated clinic doctor run against the staging environment under synthetic load
An alert on "heap_used post-GC has climbed more than 15% over a 6-hour window"
A documented memory budget per service, with --max-old-space-size set to fail loudly rather than silently

The last item is contentious — some teams prefer a high ceiling so the OOM killer is rare. We prefer a low ceiling so leaks surface in staging instead of production. Pick your poison, but pick one.

The thing that's actually hard

The diagnosis is mechanical once you know the patterns. The hard part is having the tooling in place before you need it. Heap snapshots are dramatically less useful if you only start collecting them after the incident — you have nothing to compare against. The SIGUSR2 handler, the memory-usage logging, the alert thresholds: these need to live in your base image, not be retrofitted at 3am.

If your Node services are restarting every few hours and "it's been like that for a while," the gap is rarely complicated. Get in touch and we'll usually have the leak identified within a couple of working days, and a fix in a PR within a week.

Sudhanshu K. is a Senior Platform Engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). She has been chasing Node memory leaks since 0.10 and still gets a small dopamine hit from a clean sawtooth heap graph.