Backup and Recovery Policy¶
Effective Date: 2026-03-02 Last Review: 2026-03-02 Next Review: 2026-09-02 Owner: Greg Felice, Project Lead
1. Purpose¶
This policy defines backup procedures, recovery objectives, and testing requirements for all tomo data stores. It ensures that data can be restored within defined time and point objectives following any failure, corruption, or security incident.
2. Scope¶
Applies to:
- PostgreSQL databases — tomo development and hosted service instances (PG18 with AGE, pgvector)
- Infrastructure configuration — Ansible playbooks, nginx configs, Authentik configuration
- CI/CD state — Forgejo repositories, Woodpecker pipeline configurations
- Monitoring data — Grafana dashboards and alert rules, Prometheus data
- Secrets — encrypted credential backups
3. Recovery Objectives¶
| Objective | Target | Notes |
|---|---|---|
| Recovery Time Objective (RTO) | 4 hours | Time from incident declaration to service restoration |
| Recovery Point Objective (RPO) | 1 hour | Maximum acceptable data loss, achieved via WAL archiving |
These targets apply to the hosted service. Development databases have relaxed targets (RTO: 24h, RPO: 24h).
4. Backup Architecture¶
4.1 PostgreSQL Backup Strategy¶
The backup strategy uses a three-tier approach:
| Tier | Method | Frequency | Retention | Purpose |
|---|---|---|---|---|
| Tier 1 | WAL archiving | Continuous | 7 days | Point-in-time recovery within RPO |
| Tier 2 | pg_basebackup |
Daily at 02:00 UTC | 30 days local, 90 days offsite | Full cluster recovery |
| Tier 3 | pg_dump (logical) |
Weekly (Sunday 03:00 UTC) | 90 days offsite | Cross-version restore, selective recovery |
4.2 WAL Archiving Configuration¶
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /backup/wal/%f && cp %p /backup/wal/%f'
archive_timeout = 300 # force archive every 5 minutes (worst-case RPO)
WAL segments are:
- Written to local backup directory on a separate filesystem
- Compressed with zstd before offsite transfer
- Uploaded to B2-compatible storage within 15 minutes of archival
- Retained for 7 days locally, 30 days offsite
4.3 Daily Full Backup (pg_basebackup)¶
#!/bin/bash
# /opt/tomo/scripts/backup-full.sh
# Runs daily via systemd timer at 02:00 UTC
BACKUP_DIR="/backup/pgbase/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
pg_basebackup \
--host=localhost \
--port=5432 \
--username=tomo_admin \
--pgdata="$BACKUP_DIR" \
--format=tar \
--compress=zstd:6 \
--checkpoint=fast \
--wal-method=stream \
--label="tomo-daily-$(date +%Y%m%d)"
# Verify backup is non-empty and contains required files
if [ ! -f "$BACKUP_DIR/base.tar.zst" ]; then
echo "BACKUP FAILED: base.tar.zst not found" | \
mail -s "[ALERT] tomo backup failure" security@rizlabs.com
exit 1
fi
# Upload to offsite storage
rclone copy "$BACKUP_DIR" b2:tomo-backups/pgbase/$(date +%Y%m%d) \
--b2-hard-delete \
--log-file=/var/log/tomo/backup-upload.log
4.4 Weekly Logical Backup (pg_dump)¶
#!/bin/bash
# /opt/tomo/scripts/backup-logical.sh
# Runs weekly via systemd timer (Sunday 03:00 UTC)
DUMP_FILE="/backup/pgdump/tomo-$(date +%Y%m%d).sql.zst"
pg_dump \
--host=localhost \
--port=5432 \
--username=tomo_admin \
--dbname=tomo \
--format=plain \
--no-owner \
--no-privileges \
| zstd -6 > "$DUMP_FILE"
# Upload to offsite storage
rclone copy "$DUMP_FILE" b2:tomo-backups/pgdump/ \
--b2-hard-delete \
--log-file=/var/log/tomo/backup-upload.log
4.5 Offsite Storage¶
| Property | Value |
|---|---|
| Provider | Backblaze B2 (S3-compatible API) |
| Bucket | tomo-backups |
| Encryption | AES-256 server-side encryption (B2 SSE) + client-side encryption via rclone crypt remote |
| Access | Application key scoped to single bucket, write-only (no delete without lifecycle rule) |
| Lifecycle | Auto-delete after 90 days |
| Region | US West |
4.6 Infrastructure Configuration Backup¶
| Component | Method | Frequency | Storage |
|---|---|---|---|
| Ansible playbooks | Git (Forgejo) | On every change | Forgejo + offsite mirror |
| nginx configuration | Git (Forgejo) | On every change | Forgejo + offsite mirror |
| Authentik configuration | Export via API | Weekly | B2 encrypted |
| Grafana dashboards | Provisioned from Git | On every change | Forgejo + offsite mirror |
| Woodpecker configs | .woodpecker.yml in repo |
On every change | Forgejo + offsite mirror |
| TLS certificates | Let's Encrypt auto-renewal | N/A (renewable) | Private keys in encrypted backup |
5. Encryption Requirements¶
| Stage | Method |
|---|---|
| In transit to offsite | TLS 1.2+ (B2 API over HTTPS) |
| At rest (offsite) | AES-256 (B2 SSE + rclone crypt overlay) |
| At rest (local) | LUKS volume encryption on backup filesystem |
| Encryption key storage | Stored separately from backups; documented in sealed envelope (offline) and in password manager |
6. Restore Procedures¶
6.1 Full Cluster Restore from pg_basebackup¶
# 1. Stop PostgreSQL
sudo systemctl stop postgresql@18-main
# 2. Move corrupted data directory
sudo mv /var/lib/postgresql/18/main /var/lib/postgresql/18/main.corrupted
# 3. Restore base backup
sudo mkdir /var/lib/postgresql/18/main
sudo tar -xf /backup/pgbase/YYYYMMDD/base.tar.zst \
-C /var/lib/postgresql/18/main --zstd
# 4. Configure recovery for PITR
cat > /var/lib/postgresql/18/main/postgresql.auto.conf <<EOF
restore_command = 'cp /backup/wal/%f %p'
recovery_target_time = 'YYYY-MM-DD HH:MM:SS UTC'
recovery_target_action = 'promote'
EOF
# 5. Create recovery signal
touch /var/lib/postgresql/18/main/recovery.signal
# 6. Set ownership and start
sudo chown -R postgres:postgres /var/lib/postgresql/18/main
sudo systemctl start postgresql@18-main
# 7. Monitor recovery in logs
sudo journalctl -u postgresql@18-main -f
# 8. Verify recovery
psql -p 5432 -d tomo -c "SELECT ag_catalog.agtype_build_vertex(id, label, properties) FROM ag_catalog.ag_vertex LIMIT 1;"
6.2 Selective Restore from pg_dump¶
# Restore specific schema or table from logical backup
zstd -d tomo-YYYYMMDD.sql.zst -o tomo-restore.sql
# Restore entire database
psql -p 5432 -d tomo_restore < tomo-restore.sql
# Restore specific objects (extract from dump, apply manually)
6.3 WAL-Only Recovery (within RPO window)¶
# For recovery of recent data without full restore:
# 1. Ensure base backup is in place
# 2. Configure restore_command to point to WAL archive
# 3. Set recovery_target_time to desired point
# 4. Start PostgreSQL — it will replay WAL from the base backup forward
6.4 Offsite Restore (when local backups are unavailable)¶
# 1. Download from B2
rclone copy b2:tomo-backups/pgbase/YYYYMMDD /backup/pgbase/YYYYMMDD
# 2. Download WAL segments for PITR
rclone copy b2:tomo-backups/wal/ /backup/wal/
# 3. Follow full cluster restore procedure (6.1)
7. Restore Testing¶
7.1 Weekly Automated Restore Test¶
A weekly automated test validates that backups can be restored successfully.
#!/bin/bash
# /opt/tomo/scripts/backup-test.sh
# Runs weekly via systemd timer (Wednesday 04:00 UTC)
TEST_DIR="/backup/restore-test"
TEST_PORT=5434
LATEST_BACKUP=$(ls -td /backup/pgbase/*/ | head -1)
# 1. Restore to test directory
rm -rf "$TEST_DIR"
mkdir -p "$TEST_DIR"
tar -xf "$LATEST_BACKUP/base.tar.zst" -C "$TEST_DIR" --zstd
# 2. Start temporary PostgreSQL instance
pg_ctl -D "$TEST_DIR" -o "-p $TEST_PORT" -l "$TEST_DIR/test.log" start
# 3. Run validation queries
RESULT=$(psql -p $TEST_PORT -d tomo -t -c "SELECT count(*) FROM ag_catalog.ag_graph;" 2>&1)
if [ $? -ne 0 ]; then
echo "RESTORE TEST FAILED: $RESULT" | \
mail -s "[ALERT] tomo restore test failure" security@rizlabs.com
PASSED=false
else
PASSED=true
fi
# 4. Stop and clean up
pg_ctl -D "$TEST_DIR" stop
rm -rf "$TEST_DIR"
# 5. Log result
echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) | restore-test | backup=$LATEST_BACKUP | passed=$PASSED" \
>> /var/log/tomo/restore-tests.log
7.2 Test Documentation¶
Each restore test (automated or manual) is logged with:
| Field | Description |
|---|---|
| Date | When the test was performed |
| Backup source | Which backup was restored (date, tier) |
| Restore type | Full cluster, logical, PITR |
| Duration | Time from start to verified restore |
| Result | Pass or fail, with details on failure |
| Tester | Person or automated |
Test results are retained for 12 months in /var/log/tomo/restore-tests.log and summarized in quarterly compliance reviews.
7.3 Annual Full Recovery Drill¶
Once per year, perform a complete disaster recovery simulation:
- Pretend local backups are unavailable
- Restore entirely from offsite (B2)
- Validate data integrity against known checksums
- Measure actual RTO and RPO achieved
- Document findings and update this policy if targets are not met
8. Monitoring and Alerting¶
| Check | Frequency | Alert Condition |
|---|---|---|
| Backup job completion | Daily | Backup script exits non-zero |
| Backup file size | Daily | Base backup < 80% of previous size (possible truncation) |
| WAL archiving lag | Every 5 minutes | last_archived_wal age > 10 minutes |
| Offsite upload | Daily | rclone transfer failure |
| Restore test | Weekly | Test script exits non-zero |
| Backup disk space | Hourly | Backup volume > 80% full |
Alerts are routed through Grafana to email and webhook.
9. Retention Schedule¶
| Backup Type | Local Retention | Offsite Retention |
|---|---|---|
| WAL segments | 7 days | 30 days |
pg_basebackup (daily) |
30 days | 90 days |
pg_dump (weekly) |
30 days | 90 days |
| Infrastructure config | Indefinite (Git) | Indefinite (Git mirror) |
| Restore test logs | 12 months | 12 months |
10. Compliance Mapping¶
| SOC 2 Criteria | Control |
|---|---|
| A1.2 | Recovery procedures and backup processes |
| A1.3 | Recovery testing and validation |
| CC7.4 | Recovery from identified security incidents |
| CC6.1 | Protection of backup data (encryption, access control) |