Skip to content

Incident Response Plan

Effective Date: 2026-03-02 Last Review: 2026-03-02 Next Review: 2026-09-02 Owner: Greg Felice, Project Lead

1. Purpose

This plan defines how security incidents affecting the tomo ecosystem are detected, triaged, contained, resolved, and reviewed. It ensures consistent, timely response that minimizes damage and preserves evidence for analysis.

2. Scope

This plan covers incidents affecting:

  • Tomo SDK — compromised PyPI package, supply chain attacks, malicious contributions
  • tomo Docker image — compromised container, unauthorized image pushes
  • tomo hosted service — unauthorized data access, service disruption, data breach
  • Infrastructure — dweezil server compromise, CI/CD pipeline abuse, credential theft
  • Third-party services — breaches at Docker Hub, PyPI, GitHub, Backblaze B2

3. Severity Levels

Severity Description Response Time Update Frequency Examples
P1 — Critical Active exploitation, data breach, complete service outage 30 minutes Every 2 hours Compromised PyPI package, database exfiltration, server root compromise
P2 — High Confirmed vulnerability under active threat, partial outage 4 hours Every 8 hours Exploitable CVE in production dependency, unauthorized CI credential use
P3 — Medium Vulnerability identified but not actively exploited 24 hours Daily Dependency CVE with no known exploit, misconfigured firewall rule
P4 — Low Minor security issue, informational 72 hours As needed Failed brute-force attempts, stale user account, policy deviation

4. Roles and Responsibilities

Role Responsibility Current Holder
Incident Commander (IC) Owns the incident lifecycle, makes containment decisions, coordinates communication Greg Felice
Technical Lead Performs investigation, implements containment and remediation Greg Felice
Communications Lead Drafts external notifications, updates stakeholders Greg Felice

As the team grows, these roles will be distributed. Until then, the Project Lead fills all roles.

5. Detection Sources

Source What It Detects Alert Method
Grafana / Prometheus Service health, anomalous database connections, disk/CPU spikes Grafana alerting (email, webhook)
PostgreSQL audit logs Unauthorized queries, failed auth, DDL changes Log review, Grafana dashboard
Woodpecker CI SAST findings (bandit), dependency CVEs (pip-audit), container vulnerabilities (trivy), leaked secrets (trufflehog) Pipeline failure notification
Authentik Failed login attempts, MFA bypass attempts, unusual session activity Authentik event logs
Systemd journal SSH access, sudo usage, service crashes Log review
External reports Vulnerability reports via SECURITY.md process Email (security@rizlabs.com)
Uptime monitoring Service availability Alerting (webhook)

6. Incident Response Process

Phase 1: Detection and Reporting

  1. Incident detected via monitoring alert, CI failure, log review, or external report
  2. Document initial observations: what, when, how detected, affected systems
  3. Open incident record (Forgejo issue with security-incident label, or offline log if Forgejo is compromised)
  4. Assign severity level (P1-P4)

Phase 2: Triage

  1. Confirm the incident is real (not a false positive)
  2. Identify affected systems, data, and users
  3. Determine if the incident is ongoing or historical
  4. Reassess severity if initial assessment was incorrect
  5. Decide on containment strategy (see Phase 3)

Triage decision tree:

  • Is data being actively exfiltrated? --> P1, immediate containment
  • Is a published artifact (PyPI, Docker Hub) compromised? --> P1, immediate yank/retag
  • Is infrastructure access compromised? --> P1/P2, rotate credentials immediately
  • Is a vulnerability confirmed but not exploited? --> P2/P3, plan remediation
  • Is this informational only? --> P4, document and schedule fix

Phase 3: Containment

Short-term containment (stop the bleeding):

Scenario Action
Compromised server Isolate network (firewall deny-all), preserve disk state
Compromised PyPI package pip install tomo-sdk==<safe-version>, yank compromised version
Compromised Docker image Remove tag from Docker Hub, push known-good image
Credential theft Rotate all affected credentials immediately
Database breach Revoke compromised roles, enable pg_hba deny rules
CI pipeline abuse Disable Woodpecker pipelines, rotate CI secrets

Long-term containment (prevent recurrence while preserving evidence):

  1. Rebuild affected systems from known-good state if necessary
  2. Apply patches or configuration changes
  3. Enhance monitoring for the attack vector
  4. Do not destroy forensic evidence (preserve logs, disk snapshots)

Phase 4: Eradication

  1. Identify root cause
  2. Remove attacker access (accounts, backdoors, malware)
  3. Patch the vulnerability that was exploited
  4. Verify no persistence mechanisms remain
  5. Scan for indicators of compromise (IoCs) across all systems

Phase 5: Recovery

  1. Restore services from known-good backups if needed (see Backup and Recovery Policy)
  2. Verify system integrity before returning to production
  3. Monitor closely for 72 hours after recovery
  4. Confirm all rotated credentials are propagated to dependent systems

Phase 6: Post-Incident Review

  1. Conduct review within 5 business days of incident closure
  2. Use the post-incident review template (Section 9)
  3. Document lessons learned and action items
  4. Update policies, runbooks, and monitoring as needed
  5. Store review in docs/security/incident-reviews/YYYY-MM-DD-title.md

7. Escalation Contacts

Priority Contact Method Timeframe
P1 Greg Felice Phone, Signal Immediate, 24/7
P2 Greg Felice Email, Signal Within 4 hours
P3-P4 Greg Felice Email, Forgejo issue Next business day

External escalation (if required):

Entity When Contact
PyPI Security Compromised package security@pypi.org
Docker Hub Security Compromised image security@docker.com
GitHub Security Repository compromise Via GitHub support
Upstream Apache AGE Vulnerability in AGE security@apache.org
Law enforcement Criminal activity, data breach with legal reporting obligation Local authorities

8. Communication Templates

8.1 Internal Incident Declaration

Subject: [P{severity}] Security Incident — {brief description}

Incident ID: INC-YYYY-NNN
Severity: P{1-4}
Detected: {timestamp}
Affected Systems: {list}
Current Status: {Investigating | Containing | Remediating | Resolved}

Summary:
{What happened, what we know so far}

Immediate Actions Taken:
{What has been done}

Next Steps:
{What will be done next}

Incident Commander: Greg Felice

8.2 External User Notification (P1/P2 — Data Breach or Compromised Artifact)

Subject: Security Notice — tomo {SDK | Docker Image | Hosted Service}

We are writing to inform you of a security incident affecting {component}.

What happened:
{Clear, factual description}

What data/systems were affected:
{Specific scope}

What we have done:
{Containment and remediation actions}

What you should do:
{User actions: update version, rotate credentials, etc.}

Timeline:
- {timestamp}: Incident detected
- {timestamp}: Containment completed
- {timestamp}: Remediation deployed

We will provide updates as our investigation continues. If you have questions, contact security@rizlabs.com.

Greg Felice
tomo Project Lead

8.3 Public Advisory (for SDK/Docker supply chain incidents)

Subject: [SECURITY] tomo {version} — {CVE ID if applicable}

Affected versions: {version range}
Fixed version: {version}
Severity: {Critical | High | Medium | Low}

Description:
{Technical description of the vulnerability}

Impact:
{What an attacker could do}

Mitigation:
{Steps to fix: upgrade command, workaround}

Credit:
{Reporter, if they wish to be credited}

References:
- {CVE link}
- {Related advisory links}

9. Post-Incident Review Template

# Post-Incident Review: INC-YYYY-NNN

**Date of Review:** YYYY-MM-DD
**Incident Commander:** {name}
**Participants:** {names}

## Incident Summary

- **Severity:** P{1-4}
- **Duration:** {detection to resolution}
- **Affected Systems:** {list}
- **User Impact:** {description}

## Timeline

| Time (UTC) | Event |
|------------|-------|
| {timestamp} | {event} |

## Root Cause

{What caused the incident}

## What Went Well

- {item}

## What Could Be Improved

- {item}

## Action Items

| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| {action} | {name} | {date} | {Open/Done} |

## Metrics

- **Time to detect:** {duration}
- **Time to contain:** {duration}
- **Time to resolve:** {duration}
- **Data exposed:** {scope or "None"}

10. Evidence Preservation

During any P1 or P2 incident:

  1. Do not reboot affected systems before preserving volatile evidence
  2. Capture full disk snapshot (LVM snapshot or filesystem-level copy)
  3. Export all relevant logs to a separate, secure location
  4. Record network connections (ss -tunapl, iptables -L -n)
  5. Capture running processes (ps auxf, /proc state)
  6. Timestamp and hash all evidence files (SHA-256)
  7. Maintain chain of custody documentation

11. Testing

Test Frequency Method
Tabletop exercise Annually Walk through a P1 scenario with all role holders
Detection validation Quarterly Trigger test alerts and verify notification delivery
Runbook review Semi-annually Review and update all response procedures
Communication test Annually Send test notification through all escalation channels

12. Compliance Mapping

SOC 2 Criteria Control
CC7.2 Monitoring of system components for anomalies
CC7.3 Evaluation of identified security events
CC7.4 Incident response and containment
CC7.5 Communication of incidents to affected parties
CC2.3 Internal communication of security matters