Backup and Recovery Policy¶

Effective Date: 2026-03-02 Last Review: 2026-03-02 Next Review: 2026-09-02 Owner: Greg Felice, Project Lead

1. Purpose¶

This policy defines backup procedures, recovery objectives, and testing requirements for all tomo data stores. It ensures that data can be restored within defined time and point objectives following any failure, corruption, or security incident.

2. Scope¶

Applies to:

PostgreSQL databases — tomo development and hosted service instances (PG18 with AGE, pgvector)
Infrastructure configuration — Ansible playbooks, nginx configs, Authentik configuration
CI/CD state — Forgejo repositories, Woodpecker pipeline configurations
Monitoring data — Grafana dashboards and alert rules, Prometheus data
Secrets — encrypted credential backups

3. Recovery Objectives¶

Objective	Target	Notes
Recovery Time Objective (RTO)	4 hours	Time from incident declaration to service restoration
Recovery Point Objective (RPO)	1 hour	Maximum acceptable data loss, achieved via WAL archiving

These targets apply to the hosted service. Development databases have relaxed targets (RTO: 24h, RPO: 24h).

4. Backup Architecture¶

4.1 PostgreSQL Backup Strategy¶

The backup strategy uses a three-tier approach:

Tier	Method	Frequency	Retention	Purpose
Tier 1	WAL archiving	Continuous	7 days	Point-in-time recovery within RPO
Tier 2	`pg_basebackup`	Daily at 02:00 UTC	30 days local, 90 days offsite	Full cluster recovery
Tier 3	`pg_dump` (logical)	Weekly (Sunday 03:00 UTC)	90 days offsite	Cross-version restore, selective recovery

4.2 WAL Archiving Configuration¶

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backup/wal/%f && cp %p /backup/wal/%f'
archive_timeout = 300   # force archive every 5 minutes (worst-case RPO)

WAL segments are:

Written to local backup directory on a separate filesystem
Compressed with zstd before offsite transfer
Uploaded to B2-compatible storage within 15 minutes of archival
Retained for 7 days locally, 30 days offsite

4.3 Daily Full Backup (`pg_basebackup`)¶

#!/bin/bash
# /opt/tomo/scripts/backup-full.sh
# Runs daily via systemd timer at 02:00 UTC

BACKUP_DIR="/backup/pgbase/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

pg_basebackup \
  --host=localhost \
  --port=5432 \
  --username=tomo_admin \
  --pgdata="$BACKUP_DIR" \
  --format=tar \
  --compress=zstd:6 \
  --checkpoint=fast \
  --wal-method=stream \
  --label="tomo-daily-$(date +%Y%m%d)"

# Verify backup is non-empty and contains required files
if [ ! -f "$BACKUP_DIR/base.tar.zst" ]; then
  echo "BACKUP FAILED: base.tar.zst not found" | \
    mail -s "[ALERT] tomo backup failure" security@rizlabs.com
  exit 1
fi

# Upload to offsite storage
rclone copy "$BACKUP_DIR" b2:tomo-backups/pgbase/$(date +%Y%m%d) \
  --b2-hard-delete \
  --log-file=/var/log/tomo/backup-upload.log

4.4 Weekly Logical Backup (`pg_dump`)¶

#!/bin/bash
# /opt/tomo/scripts/backup-logical.sh
# Runs weekly via systemd timer (Sunday 03:00 UTC)

DUMP_FILE="/backup/pgdump/tomo-$(date +%Y%m%d).sql.zst"

pg_dump \
  --host=localhost \
  --port=5432 \
  --username=tomo_admin \
  --dbname=tomo \
  --format=plain \
  --no-owner \
  --no-privileges \
  | zstd -6 > "$DUMP_FILE"

# Upload to offsite storage
rclone copy "$DUMP_FILE" b2:tomo-backups/pgdump/ \
  --b2-hard-delete \
  --log-file=/var/log/tomo/backup-upload.log

4.5 Offsite Storage¶

Property	Value
Provider	Backblaze B2 (S3-compatible API)
Bucket	`tomo-backups`
Encryption	AES-256 server-side encryption (B2 SSE) + client-side encryption via rclone `crypt` remote
Access	Application key scoped to single bucket, write-only (no delete without lifecycle rule)
Lifecycle	Auto-delete after 90 days
Region	US West

4.6 Infrastructure Configuration Backup¶

Component	Method	Frequency	Storage
Ansible playbooks	Git (Forgejo)	On every change	Forgejo + offsite mirror
nginx configuration	Git (Forgejo)	On every change	Forgejo + offsite mirror
Authentik configuration	Export via API	Weekly	B2 encrypted
Grafana dashboards	Provisioned from Git	On every change	Forgejo + offsite mirror
Woodpecker configs	`.woodpecker.yml` in repo	On every change	Forgejo + offsite mirror
TLS certificates	Let's Encrypt auto-renewal	N/A (renewable)	Private keys in encrypted backup

5. Encryption Requirements¶

Stage	Method
In transit to offsite	TLS 1.2+ (B2 API over HTTPS)
At rest (offsite)	AES-256 (B2 SSE + rclone crypt overlay)
At rest (local)	LUKS volume encryption on backup filesystem
Encryption key storage	Stored separately from backups; documented in sealed envelope (offline) and in password manager

6. Restore Procedures¶

6.1 Full Cluster Restore from `pg_basebackup`¶

# 1. Stop PostgreSQL
sudo systemctl stop postgresql@18-main

# 2. Move corrupted data directory
sudo mv /var/lib/postgresql/18/main /var/lib/postgresql/18/main.corrupted

# 3. Restore base backup
sudo mkdir /var/lib/postgresql/18/main
sudo tar -xf /backup/pgbase/YYYYMMDD/base.tar.zst \
  -C /var/lib/postgresql/18/main --zstd

# 4. Configure recovery for PITR
cat > /var/lib/postgresql/18/main/postgresql.auto.conf <<EOF
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = 'YYYY-MM-DD HH:MM:SS UTC'
recovery_target_action = 'promote'
EOF

# 5. Create recovery signal
touch /var/lib/postgresql/18/main/recovery.signal

# 6. Set ownership and start
sudo chown -R postgres:postgres /var/lib/postgresql/18/main
sudo systemctl start postgresql@18-main

# 7. Monitor recovery in logs
sudo journalctl -u postgresql@18-main -f

# 8. Verify recovery
psql -p 5432 -d tomo -c "SELECT ag_catalog.agtype_build_vertex(id, label, properties) FROM ag_catalog.ag_vertex LIMIT 1;"

6.2 Selective Restore from `pg_dump`¶

# Restore specific schema or table from logical backup
zstd -d tomo-YYYYMMDD.sql.zst -o tomo-restore.sql

# Restore entire database
psql -p 5432 -d tomo_restore < tomo-restore.sql

# Restore specific objects (extract from dump, apply manually)

6.3 WAL-Only Recovery (within RPO window)¶

# For recovery of recent data without full restore:
# 1. Ensure base backup is in place
# 2. Configure restore_command to point to WAL archive
# 3. Set recovery_target_time to desired point
# 4. Start PostgreSQL — it will replay WAL from the base backup forward

6.4 Offsite Restore (when local backups are unavailable)¶

# 1. Download from B2
rclone copy b2:tomo-backups/pgbase/YYYYMMDD /backup/pgbase/YYYYMMDD

# 2. Download WAL segments for PITR
rclone copy b2:tomo-backups/wal/ /backup/wal/

# 3. Follow full cluster restore procedure (6.1)

7. Restore Testing¶

7.1 Weekly Automated Restore Test¶

A weekly automated test validates that backups can be restored successfully.

#!/bin/bash
# /opt/tomo/scripts/backup-test.sh
# Runs weekly via systemd timer (Wednesday 04:00 UTC)

TEST_DIR="/backup/restore-test"
TEST_PORT=5434
LATEST_BACKUP=$(ls -td /backup/pgbase/*/ | head -1)

# 1. Restore to test directory
rm -rf "$TEST_DIR"
mkdir -p "$TEST_DIR"
tar -xf "$LATEST_BACKUP/base.tar.zst" -C "$TEST_DIR" --zstd

# 2. Start temporary PostgreSQL instance
pg_ctl -D "$TEST_DIR" -o "-p $TEST_PORT" -l "$TEST_DIR/test.log" start

# 3. Run validation queries
RESULT=$(psql -p $TEST_PORT -d tomo -t -c "SELECT count(*) FROM ag_catalog.ag_graph;" 2>&1)
if [ $? -ne 0 ]; then
  echo "RESTORE TEST FAILED: $RESULT" | \
    mail -s "[ALERT] tomo restore test failure" security@rizlabs.com
  PASSED=false
else
  PASSED=true
fi

# 4. Stop and clean up
pg_ctl -D "$TEST_DIR" stop
rm -rf "$TEST_DIR"

# 5. Log result
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | restore-test | backup=$LATEST_BACKUP | passed=$PASSED" \
  >> /var/log/tomo/restore-tests.log

7.2 Test Documentation¶

Each restore test (automated or manual) is logged with:

Field	Description
Date	When the test was performed
Backup source	Which backup was restored (date, tier)
Restore type	Full cluster, logical, PITR
Duration	Time from start to verified restore
Result	Pass or fail, with details on failure
Tester	Person or `automated`

Test results are retained for 12 months in /var/log/tomo/restore-tests.log and summarized in quarterly compliance reviews.

7.3 Annual Full Recovery Drill¶

Once per year, perform a complete disaster recovery simulation:

Pretend local backups are unavailable
Restore entirely from offsite (B2)
Validate data integrity against known checksums
Measure actual RTO and RPO achieved
Document findings and update this policy if targets are not met

8. Monitoring and Alerting¶

Check	Frequency	Alert Condition
Backup job completion	Daily	Backup script exits non-zero
Backup file size	Daily	Base backup < 80% of previous size (possible truncation)
WAL archiving lag	Every 5 minutes	`last_archived_wal` age > 10 minutes
Offsite upload	Daily	rclone transfer failure
Restore test	Weekly	Test script exits non-zero
Backup disk space	Hourly	Backup volume > 80% full

Alerts are routed through Grafana to email and webhook.

9. Retention Schedule¶

Backup Type	Local Retention	Offsite Retention
WAL segments	7 days	30 days
`pg_basebackup` (daily)	30 days	90 days
`pg_dump` (weekly)	30 days	90 days
Infrastructure config	Indefinite (Git)	Indefinite (Git mirror)
Restore test logs	12 months	12 months

10. Compliance Mapping¶

SOC 2 Criteria	Control
A1.2	Recovery procedures and backup processes
A1.3	Recovery testing and validation
CC7.4	Recovery from identified security incidents
CC6.1	Protection of backup data (encryption, access control)