Skip to content

Backup and Recovery Policy

Effective Date: 2026-03-02 Last Review: 2026-03-02 Next Review: 2026-09-02 Owner: Greg Felice, Project Lead

1. Purpose

This policy defines backup procedures, recovery objectives, and testing requirements for all tomo data stores. It ensures that data can be restored within defined time and point objectives following any failure, corruption, or security incident.

2. Scope

Applies to:

  • PostgreSQL databases — tomo development and hosted service instances (PG18 with AGE, pgvector)
  • Infrastructure configuration — Ansible playbooks, nginx configs, Authentik configuration
  • CI/CD state — Forgejo repositories, Woodpecker pipeline configurations
  • Monitoring data — Grafana dashboards and alert rules, Prometheus data
  • Secrets — encrypted credential backups

3. Recovery Objectives

Objective Target Notes
Recovery Time Objective (RTO) 4 hours Time from incident declaration to service restoration
Recovery Point Objective (RPO) 1 hour Maximum acceptable data loss, achieved via WAL archiving

These targets apply to the hosted service. Development databases have relaxed targets (RTO: 24h, RPO: 24h).

4. Backup Architecture

4.1 PostgreSQL Backup Strategy

The backup strategy uses a three-tier approach:

Tier Method Frequency Retention Purpose
Tier 1 WAL archiving Continuous 7 days Point-in-time recovery within RPO
Tier 2 pg_basebackup Daily at 02:00 UTC 30 days local, 90 days offsite Full cluster recovery
Tier 3 pg_dump (logical) Weekly (Sunday 03:00 UTC) 90 days offsite Cross-version restore, selective recovery

4.2 WAL Archiving Configuration

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backup/wal/%f && cp %p /backup/wal/%f'
archive_timeout = 300   # force archive every 5 minutes (worst-case RPO)

WAL segments are:

  1. Written to local backup directory on a separate filesystem
  2. Compressed with zstd before offsite transfer
  3. Uploaded to B2-compatible storage within 15 minutes of archival
  4. Retained for 7 days locally, 30 days offsite

4.3 Daily Full Backup (pg_basebackup)

#!/bin/bash
# /opt/tomo/scripts/backup-full.sh
# Runs daily via systemd timer at 02:00 UTC

BACKUP_DIR="/backup/pgbase/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

pg_basebackup \
  --host=localhost \
  --port=5432 \
  --username=tomo_admin \
  --pgdata="$BACKUP_DIR" \
  --format=tar \
  --compress=zstd:6 \
  --checkpoint=fast \
  --wal-method=stream \
  --label="tomo-daily-$(date +%Y%m%d)"

# Verify backup is non-empty and contains required files
if [ ! -f "$BACKUP_DIR/base.tar.zst" ]; then
  echo "BACKUP FAILED: base.tar.zst not found" | \
    mail -s "[ALERT] tomo backup failure" security@rizlabs.com
  exit 1
fi

# Upload to offsite storage
rclone copy "$BACKUP_DIR" b2:tomo-backups/pgbase/$(date +%Y%m%d) \
  --b2-hard-delete \
  --log-file=/var/log/tomo/backup-upload.log

4.4 Weekly Logical Backup (pg_dump)

#!/bin/bash
# /opt/tomo/scripts/backup-logical.sh
# Runs weekly via systemd timer (Sunday 03:00 UTC)

DUMP_FILE="/backup/pgdump/tomo-$(date +%Y%m%d).sql.zst"

pg_dump \
  --host=localhost \
  --port=5432 \
  --username=tomo_admin \
  --dbname=tomo \
  --format=plain \
  --no-owner \
  --no-privileges \
  | zstd -6 > "$DUMP_FILE"

# Upload to offsite storage
rclone copy "$DUMP_FILE" b2:tomo-backups/pgdump/ \
  --b2-hard-delete \
  --log-file=/var/log/tomo/backup-upload.log

4.5 Offsite Storage

Property Value
Provider Backblaze B2 (S3-compatible API)
Bucket tomo-backups
Encryption AES-256 server-side encryption (B2 SSE) + client-side encryption via rclone crypt remote
Access Application key scoped to single bucket, write-only (no delete without lifecycle rule)
Lifecycle Auto-delete after 90 days
Region US West

4.6 Infrastructure Configuration Backup

Component Method Frequency Storage
Ansible playbooks Git (Forgejo) On every change Forgejo + offsite mirror
nginx configuration Git (Forgejo) On every change Forgejo + offsite mirror
Authentik configuration Export via API Weekly B2 encrypted
Grafana dashboards Provisioned from Git On every change Forgejo + offsite mirror
Woodpecker configs .woodpecker.yml in repo On every change Forgejo + offsite mirror
TLS certificates Let's Encrypt auto-renewal N/A (renewable) Private keys in encrypted backup

5. Encryption Requirements

Stage Method
In transit to offsite TLS 1.2+ (B2 API over HTTPS)
At rest (offsite) AES-256 (B2 SSE + rclone crypt overlay)
At rest (local) LUKS volume encryption on backup filesystem
Encryption key storage Stored separately from backups; documented in sealed envelope (offline) and in password manager

6. Restore Procedures

6.1 Full Cluster Restore from pg_basebackup

# 1. Stop PostgreSQL
sudo systemctl stop postgresql@18-main

# 2. Move corrupted data directory
sudo mv /var/lib/postgresql/18/main /var/lib/postgresql/18/main.corrupted

# 3. Restore base backup
sudo mkdir /var/lib/postgresql/18/main
sudo tar -xf /backup/pgbase/YYYYMMDD/base.tar.zst \
  -C /var/lib/postgresql/18/main --zstd

# 4. Configure recovery for PITR
cat > /var/lib/postgresql/18/main/postgresql.auto.conf <<EOF
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = 'YYYY-MM-DD HH:MM:SS UTC'
recovery_target_action = 'promote'
EOF

# 5. Create recovery signal
touch /var/lib/postgresql/18/main/recovery.signal

# 6. Set ownership and start
sudo chown -R postgres:postgres /var/lib/postgresql/18/main
sudo systemctl start postgresql@18-main

# 7. Monitor recovery in logs
sudo journalctl -u postgresql@18-main -f

# 8. Verify recovery
psql -p 5432 -d tomo -c "SELECT ag_catalog.agtype_build_vertex(id, label, properties) FROM ag_catalog.ag_vertex LIMIT 1;"

6.2 Selective Restore from pg_dump

# Restore specific schema or table from logical backup
zstd -d tomo-YYYYMMDD.sql.zst -o tomo-restore.sql

# Restore entire database
psql -p 5432 -d tomo_restore < tomo-restore.sql

# Restore specific objects (extract from dump, apply manually)

6.3 WAL-Only Recovery (within RPO window)

# For recovery of recent data without full restore:
# 1. Ensure base backup is in place
# 2. Configure restore_command to point to WAL archive
# 3. Set recovery_target_time to desired point
# 4. Start PostgreSQL — it will replay WAL from the base backup forward

6.4 Offsite Restore (when local backups are unavailable)

# 1. Download from B2
rclone copy b2:tomo-backups/pgbase/YYYYMMDD /backup/pgbase/YYYYMMDD

# 2. Download WAL segments for PITR
rclone copy b2:tomo-backups/wal/ /backup/wal/

# 3. Follow full cluster restore procedure (6.1)

7. Restore Testing

7.1 Weekly Automated Restore Test

A weekly automated test validates that backups can be restored successfully.

#!/bin/bash
# /opt/tomo/scripts/backup-test.sh
# Runs weekly via systemd timer (Wednesday 04:00 UTC)

TEST_DIR="/backup/restore-test"
TEST_PORT=5434
LATEST_BACKUP=$(ls -td /backup/pgbase/*/ | head -1)

# 1. Restore to test directory
rm -rf "$TEST_DIR"
mkdir -p "$TEST_DIR"
tar -xf "$LATEST_BACKUP/base.tar.zst" -C "$TEST_DIR" --zstd

# 2. Start temporary PostgreSQL instance
pg_ctl -D "$TEST_DIR" -o "-p $TEST_PORT" -l "$TEST_DIR/test.log" start

# 3. Run validation queries
RESULT=$(psql -p $TEST_PORT -d tomo -t -c "SELECT count(*) FROM ag_catalog.ag_graph;" 2>&1)
if [ $? -ne 0 ]; then
  echo "RESTORE TEST FAILED: $RESULT" | \
    mail -s "[ALERT] tomo restore test failure" security@rizlabs.com
  PASSED=false
else
  PASSED=true
fi

# 4. Stop and clean up
pg_ctl -D "$TEST_DIR" stop
rm -rf "$TEST_DIR"

# 5. Log result
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | restore-test | backup=$LATEST_BACKUP | passed=$PASSED" \
  >> /var/log/tomo/restore-tests.log

7.2 Test Documentation

Each restore test (automated or manual) is logged with:

Field Description
Date When the test was performed
Backup source Which backup was restored (date, tier)
Restore type Full cluster, logical, PITR
Duration Time from start to verified restore
Result Pass or fail, with details on failure
Tester Person or automated

Test results are retained for 12 months in /var/log/tomo/restore-tests.log and summarized in quarterly compliance reviews.

7.3 Annual Full Recovery Drill

Once per year, perform a complete disaster recovery simulation:

  1. Pretend local backups are unavailable
  2. Restore entirely from offsite (B2)
  3. Validate data integrity against known checksums
  4. Measure actual RTO and RPO achieved
  5. Document findings and update this policy if targets are not met

8. Monitoring and Alerting

Check Frequency Alert Condition
Backup job completion Daily Backup script exits non-zero
Backup file size Daily Base backup < 80% of previous size (possible truncation)
WAL archiving lag Every 5 minutes last_archived_wal age > 10 minutes
Offsite upload Daily rclone transfer failure
Restore test Weekly Test script exits non-zero
Backup disk space Hourly Backup volume > 80% full

Alerts are routed through Grafana to email and webhook.

9. Retention Schedule

Backup Type Local Retention Offsite Retention
WAL segments 7 days 30 days
pg_basebackup (daily) 30 days 90 days
pg_dump (weekly) 30 days 90 days
Infrastructure config Indefinite (Git) Indefinite (Git mirror)
Restore test logs 12 months 12 months

10. Compliance Mapping

SOC 2 Criteria Control
A1.2 Recovery procedures and backup processes
A1.3 Recovery testing and validation
CC7.4 Recovery from identified security incidents
CC6.1 Protection of backup data (encryption, access control)