MySQL backups that actually restore: XtraBackup, binlogs, and the drill we run quarterly

I'll start with the uncomfortable observation: roughly half of the MySQL deployments we audit have a backup strategy that consists of mysqldump --all-databases | gzip > backup.sql.gz running in cron, output going to the same disk the database is on, with no retention rotation and no restore ever tested. The team has the dashboard ticked. The runbook says "we have backups." The auditor was satisfied.

It will not survive a real incident.

This post is the managed MySQL backup setup we deploy by default: Percona XtraBackup for hot physical backups, binary log archival for point-in-time recovery, and a quarterly restore drill that we actually run. Across the customers we manage on AWS, GCP, Azure, and DigitalOcean, this is the shape that has yet to fail us.

Why not mysqldump

mysqldump is a logical backup tool. It works by issuing SELECT statements against every table and writing the results as INSERT statements. Reasons it is wrong for production at scale:

It's slow on dump and slower on restore. A 200GB database can take 4 hours to dump and 12+ hours to restore (because the restore has to rebuild every index from scratch).
It holds long transactions. --single-transaction makes the dump consistent, but the underlying long-running snapshot transaction blocks purge of old InnoDB undo pages, causing the undo log to balloon.
It can't do point-in-time recovery on its own. You can replay binlogs after a dump, but only if you captured the binlog position at the start of the dump (and the binlogs are preserved).
Restore is single-threaded by default. Modern MySQL has mysqlpump and parallelism flags but the throughput ceiling is still well below what hardware can do.

mysqldump is a fine tool for small databases, dev seeding, and schema-only snapshots. It's not a production-grade backup strategy.

What we use: Percona XtraBackup + binlog archival

Percona XtraBackup (now xtrabackup shipped with Percona Server, also works against community MySQL 8.x) does hot physical backups — it copies the InnoDB tablespace files while the database is running, then captures the redo log to get a consistent snapshot at a specific LSN.

Properties that matter:

Non-blocking against InnoDB tables (no global lock, no read pause)
Backup speed limited by I/O — typically 200-500 MB/s on modern SSDs
Restore is basically cp plus a redo-log replay — fast, predictable
Captures the binary log position at backup completion, which is exactly what PITR needs

Combined with continuous binary log archival to object storage, XtraBackup gives us:

Full backup nightly (or however often)
Incremental backups every 4-6 hours (only the changed pages)
Binary logs archived continuously to S3/GCS/Azure Blob/Spaces
Point-in-time recovery to any second since the oldest retained full backup

The backup configuration

# /etc/my.cnf — relevant settings
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin
binlog_format = ROW
binlog_row_image = FULL
binlog_expire_logs_seconds = 604800       # 7 days local retention
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
gtid_mode = ON
enforce_gtid_consistency = ON

binlog_format = ROW is non-negotiable for reliable PITR. Statement-based binlogs replay non-deterministic statements (NOW(), triggers, certain UPDATEs) in ways that don't reproduce the original state.

sync_binlog = 1 makes binlog writes durable per transaction. The performance cost on modern SSD is typically 5-10%; we pay it on every customer.

The nightly backup

A reasonable nightly looks like this:

#!/bin/bash
set -euo pipefail
DATE=$(date -u +%Y-%m-%dT%H-%M-%SZ)
BACKUP_DIR=/var/backups/mysql/$DATE
S3_BUCKET=s3://customer-mysql-backups/prod
 
xtrabackup --backup \
  --user=xtrabackup_user \
  --target-dir=$BACKUP_DIR \
  --parallel=4 \
  --compress \
  --compress-threads=4 \
  --slave-info \
  --safe-slave-backup
 
xtrabackup --prepare --target-dir=$BACKUP_DIR
 
aws s3 sync $BACKUP_DIR $S3_BUCKET/$DATE/ \
  --storage-class=STANDARD_IA \
  --sse=AES256
 
# Verify the upload
aws s3 ls --recursive $S3_BUCKET/$DATE/ | wc -l
rm -rf $BACKUP_DIR

In practice we wrap this with mariabackup or xtrabackup's --stream=xbstream piped directly to aws s3 cp to avoid the intermediate local copy on big databases. We also enable server-side encryption (or client-side, depending on the customer's compliance posture).

The --prepare step applies the captured redo log to make the backup consistent. A non-prepared backup is not restorable. We've seen "backups" running for years where the prepare step was forgotten — the files are there, they look like a database, but they cannot be started.

Binlog archival

The binlogs are what makes PITR possible. Local retention of 7 days protects against short-window issues; offsite archival to object storage gives you the long-window recovery.

# Run every 5 minutes
mysqlbinlog --read-from-remote-server \
  --raw \
  --host=localhost \
  --result-file=/var/log/mysql-archive/ \
  --to-last-log \
  --stop-never \
  $(get_oldest_unarchived_binlog)
 
aws s3 sync /var/log/mysql-archive/ \
  $S3_BUCKET/binlogs/ \
  --exclude "*" \
  --include "mysql-bin.*"

A cleaner shape uses MySQL 8.x's binlog file rotation event combined with a FLUSH BINARY LOGS hook to ensure that any binlog file we ship to S3 is closed (and therefore stable) before we upload it. The pitfall to avoid: uploading the active binlog file. It's mid-write; the copy is corrupt.

The restore drill (quarterly, scripted)

This is where most setups break, because nobody has ever run it. The drill is a roughly 90-minute exercise that costs about $2 of cloud compute and answers the question "can we actually recover?"

Step 1: Pick a target time (5 min)

DRILL_TARGET="2026-05-18 14:23:00 UTC"

We pick a timestamp from yesterday during business hours, when there were active writes. The drill must exercise both base restore and binlog replay.

Step 2: Provision a clean host (10 min)

Same MySQL version, same OS, same instance type. Different VPC if possible. For our DigitalOcean customers this is a Terraform one-liner that brings up a fresh droplet, runs the drill, and destroys it.

Step 3: Find the right base backup (2 min)

aws s3 ls s3://customer-mysql-backups/prod/ | \
  awk '{print $2}' | sort -r | head -20
# Pick the most recent backup BEFORE the target time

Step 4: Restore the base (20-40 min — longest step)

aws s3 sync s3://customer-mysql-backups/prod/2026-05-18T03-00-00Z/ \
  /var/lib/mysql-restore/
 
xtrabackup --decompress --target-dir=/var/lib/mysql-restore/
xtrabackup --prepare --target-dir=/var/lib/mysql-restore/
 
sudo systemctl stop mysql
sudo rsync -a /var/lib/mysql-restore/ /var/lib/mysql/
sudo chown -R mysql:mysql /var/lib/mysql/
sudo systemctl start mysql

The MySQL data directory must be empty before restore. The single most common drill failure we see is people restoring on top of an existing data directory and getting silently corrupted state.

Step 5: Find the binlog position (2 min)

cat /var/lib/mysql/xtrabackup_binlog_info
# mysql-bin.000312    18472    abc-123-uuid:1-9876543

That UUID:N-M is the GTID set at backup completion. Everything after that GTID needs to be replayed from the archived binlogs.

Step 6: Replay binlogs to the target time (15-30 min)

mysqlbinlog \
  --start-position=18472 \
  --stop-datetime="2026-05-18 14:23:00" \
  /var/log/mysql-archive/mysql-bin.000312 \
  /var/log/mysql-archive/mysql-bin.000313 \
  /var/log/mysql-archive/mysql-bin.000314 \
  | mysql -u root

For GTID-based replication, use --exclude-gtids for any GTIDs already present in the restored base.

This step is where you discover that your binlog archive has gaps. The fix is to alert on binlog archival latency in production — we alert at "newest archived binlog older than 10 minutes" so gaps are caught immediately, not at drill time.

Step 7: Verify (10 min)

SELECT NOW();
SELECT MAX(created_at) FROM orders;
SELECT COUNT(*) FROM users WHERE created_at < '2026-05-18 14:23:00';
SHOW SLAVE STATUS\G   -- if applicable
SELECT @@gtid_executed;

The MAX(created_at) should be at or just before the drill target. The gtid_executed set should match what you expected. If either is off, the restore didn't actually work — investigate before tearing down.

Step 8: Application smoke tests (10 min)

This is the difference between "MySQL started up" and "the database is actually usable." Connect with the application's credentials, run the standard read queries the app makes at boot, check that views, stored procedures, and triggers are intact. Verify that any encrypted columns can be decrypted (the master key has to be available).

Step 9: Teardown and document (10 min)

Destroy the drill host. Write a short report covering measured restore time per phase, any errors, and any drift between drilled data and a production snapshot. Every drill is data — the trends matter as much as any single result.

What the drill finds, in practice

The findings are unfailingly the same across the dozens of managed MySQL deployments we audit:

Backups aren't prepared. XtraBackup without --prepare is not restorable. Found in roughly 1 in 5 setups we inherit.
Binlog gaps. Archival cron failed three months ago, nobody noticed, the gap is now baked into the recovery story.
Encryption keys not backed up. Restore succeeds, encrypted columns are gibberish, the application can't function.
The grants table wasn't part of the backup. XtraBackup grabs mysql.user if you include it, but custom setups sometimes exclude system schemas. Restore comes up, application can't authenticate.
Backups too old to be useful. Local retention is fine, but the offsite copy is on a 7-day rotation, and the breach happened 14 days ago. Restore is to a known-compromised state.
Restore time is much longer than assumed. Customer thought "we can restore in an hour"; the actual drill measured 4.5 hours for a 600GB database. That's the real RTO, and it's worth knowing.

Cost and retention

For a typical 200GB MySQL database with moderate write volume, our default retention is:

14 days of daily full backups
13 weekly fulls (one per week, rotating)
12 monthly fulls (one per month, rotating to Glacier/Coldline at month 2)
All binlogs for the duration of the longest full backup retention

Storage cost works out to roughly $15-25/month for a 200GB database on S3 with Intelligent Tiering. The compute cost of the drill is negligible compared to the cost of being wrong about it.

What we ship by default

For every managed MySQL customer:

Percona XtraBackup nightly full + 4-hourly incremental, prepared, encrypted, shipped to object storage
Binlog archival every 5 minutes with monitoring on archival lag
Drilled quarterly with a written report
A documented RPO (typically 5 minutes) and RTO (typically 30-60 minutes for under-100GB, scaling from there)
A second-region async replica for the customers who need geographic redundancy on top of restorability

If your MySQL backup strategy has never been tested with a real restore, you do not have a backup strategy — you have backup files. Reach out and we'll run the drill for you. The output is a one-page report and a list of fixes, usually small, that turn "we have backups" into "we have backups we have restored from."

Sudhanshu K. is a Staff DevOps engineer at EdgeServers (RemotIQ Pty Ltd, ABN 91 682 628 128). He has run XtraBackup restores in anger six times across three different customers and has the muscle memory for it.