Incident Response Plan¶
Effective Date: 2026-03-02 Last Review: 2026-03-02 Next Review: 2026-09-02 Owner: Greg Felice, Project Lead
1. Purpose¶
This plan defines how security incidents affecting the tomo ecosystem are detected, triaged, contained, resolved, and reviewed. It ensures consistent, timely response that minimizes damage and preserves evidence for analysis.
2. Scope¶
This plan covers incidents affecting:
- Tomo SDK — compromised PyPI package, supply chain attacks, malicious contributions
- tomo Docker image — compromised container, unauthorized image pushes
- tomo hosted service — unauthorized data access, service disruption, data breach
- Infrastructure — dweezil server compromise, CI/CD pipeline abuse, credential theft
- Third-party services — breaches at Docker Hub, PyPI, GitHub, Backblaze B2
3. Severity Levels¶
| Severity | Description | Response Time | Update Frequency | Examples |
|---|---|---|---|---|
| P1 — Critical | Active exploitation, data breach, complete service outage | 30 minutes | Every 2 hours | Compromised PyPI package, database exfiltration, server root compromise |
| P2 — High | Confirmed vulnerability under active threat, partial outage | 4 hours | Every 8 hours | Exploitable CVE in production dependency, unauthorized CI credential use |
| P3 — Medium | Vulnerability identified but not actively exploited | 24 hours | Daily | Dependency CVE with no known exploit, misconfigured firewall rule |
| P4 — Low | Minor security issue, informational | 72 hours | As needed | Failed brute-force attempts, stale user account, policy deviation |
4. Roles and Responsibilities¶
| Role | Responsibility | Current Holder |
|---|---|---|
| Incident Commander (IC) | Owns the incident lifecycle, makes containment decisions, coordinates communication | Greg Felice |
| Technical Lead | Performs investigation, implements containment and remediation | Greg Felice |
| Communications Lead | Drafts external notifications, updates stakeholders | Greg Felice |
As the team grows, these roles will be distributed. Until then, the Project Lead fills all roles.
5. Detection Sources¶
| Source | What It Detects | Alert Method |
|---|---|---|
| Grafana / Prometheus | Service health, anomalous database connections, disk/CPU spikes | Grafana alerting (email, webhook) |
| PostgreSQL audit logs | Unauthorized queries, failed auth, DDL changes | Log review, Grafana dashboard |
| Woodpecker CI | SAST findings (bandit), dependency CVEs (pip-audit), container vulnerabilities (trivy), leaked secrets (trufflehog) | Pipeline failure notification |
| Authentik | Failed login attempts, MFA bypass attempts, unusual session activity | Authentik event logs |
| Systemd journal | SSH access, sudo usage, service crashes | Log review |
| External reports | Vulnerability reports via SECURITY.md process | Email (security@rizlabs.com) |
| Uptime monitoring | Service availability | Alerting (webhook) |
6. Incident Response Process¶
Phase 1: Detection and Reporting¶
- Incident detected via monitoring alert, CI failure, log review, or external report
- Document initial observations: what, when, how detected, affected systems
- Open incident record (Forgejo issue with
security-incidentlabel, or offline log if Forgejo is compromised) - Assign severity level (P1-P4)
Phase 2: Triage¶
- Confirm the incident is real (not a false positive)
- Identify affected systems, data, and users
- Determine if the incident is ongoing or historical
- Reassess severity if initial assessment was incorrect
- Decide on containment strategy (see Phase 3)
Triage decision tree:
- Is data being actively exfiltrated? --> P1, immediate containment
- Is a published artifact (PyPI, Docker Hub) compromised? --> P1, immediate yank/retag
- Is infrastructure access compromised? --> P1/P2, rotate credentials immediately
- Is a vulnerability confirmed but not exploited? --> P2/P3, plan remediation
- Is this informational only? --> P4, document and schedule fix
Phase 3: Containment¶
Short-term containment (stop the bleeding):
| Scenario | Action |
|---|---|
| Compromised server | Isolate network (firewall deny-all), preserve disk state |
| Compromised PyPI package | pip install tomo-sdk==<safe-version>, yank compromised version |
| Compromised Docker image | Remove tag from Docker Hub, push known-good image |
| Credential theft | Rotate all affected credentials immediately |
| Database breach | Revoke compromised roles, enable pg_hba deny rules |
| CI pipeline abuse | Disable Woodpecker pipelines, rotate CI secrets |
Long-term containment (prevent recurrence while preserving evidence):
- Rebuild affected systems from known-good state if necessary
- Apply patches or configuration changes
- Enhance monitoring for the attack vector
- Do not destroy forensic evidence (preserve logs, disk snapshots)
Phase 4: Eradication¶
- Identify root cause
- Remove attacker access (accounts, backdoors, malware)
- Patch the vulnerability that was exploited
- Verify no persistence mechanisms remain
- Scan for indicators of compromise (IoCs) across all systems
Phase 5: Recovery¶
- Restore services from known-good backups if needed (see Backup and Recovery Policy)
- Verify system integrity before returning to production
- Monitor closely for 72 hours after recovery
- Confirm all rotated credentials are propagated to dependent systems
Phase 6: Post-Incident Review¶
- Conduct review within 5 business days of incident closure
- Use the post-incident review template (Section 9)
- Document lessons learned and action items
- Update policies, runbooks, and monitoring as needed
- Store review in
docs/security/incident-reviews/YYYY-MM-DD-title.md
7. Escalation Contacts¶
| Priority | Contact | Method | Timeframe |
|---|---|---|---|
| P1 | Greg Felice | Phone, Signal | Immediate, 24/7 |
| P2 | Greg Felice | Email, Signal | Within 4 hours |
| P3-P4 | Greg Felice | Email, Forgejo issue | Next business day |
External escalation (if required):
| Entity | When | Contact |
|---|---|---|
| PyPI Security | Compromised package | security@pypi.org |
| Docker Hub Security | Compromised image | security@docker.com |
| GitHub Security | Repository compromise | Via GitHub support |
| Upstream Apache AGE | Vulnerability in AGE | security@apache.org |
| Law enforcement | Criminal activity, data breach with legal reporting obligation | Local authorities |
8. Communication Templates¶
8.1 Internal Incident Declaration¶
Subject: [P{severity}] Security Incident — {brief description}
Incident ID: INC-YYYY-NNN
Severity: P{1-4}
Detected: {timestamp}
Affected Systems: {list}
Current Status: {Investigating | Containing | Remediating | Resolved}
Summary:
{What happened, what we know so far}
Immediate Actions Taken:
{What has been done}
Next Steps:
{What will be done next}
Incident Commander: Greg Felice
8.2 External User Notification (P1/P2 — Data Breach or Compromised Artifact)¶
Subject: Security Notice — tomo {SDK | Docker Image | Hosted Service}
We are writing to inform you of a security incident affecting {component}.
What happened:
{Clear, factual description}
What data/systems were affected:
{Specific scope}
What we have done:
{Containment and remediation actions}
What you should do:
{User actions: update version, rotate credentials, etc.}
Timeline:
- {timestamp}: Incident detected
- {timestamp}: Containment completed
- {timestamp}: Remediation deployed
We will provide updates as our investigation continues. If you have questions, contact security@rizlabs.com.
Greg Felice
tomo Project Lead
8.3 Public Advisory (for SDK/Docker supply chain incidents)¶
Subject: [SECURITY] tomo {version} — {CVE ID if applicable}
Affected versions: {version range}
Fixed version: {version}
Severity: {Critical | High | Medium | Low}
Description:
{Technical description of the vulnerability}
Impact:
{What an attacker could do}
Mitigation:
{Steps to fix: upgrade command, workaround}
Credit:
{Reporter, if they wish to be credited}
References:
- {CVE link}
- {Related advisory links}
9. Post-Incident Review Template¶
# Post-Incident Review: INC-YYYY-NNN
**Date of Review:** YYYY-MM-DD
**Incident Commander:** {name}
**Participants:** {names}
## Incident Summary
- **Severity:** P{1-4}
- **Duration:** {detection to resolution}
- **Affected Systems:** {list}
- **User Impact:** {description}
## Timeline
| Time (UTC) | Event |
|------------|-------|
| {timestamp} | {event} |
## Root Cause
{What caused the incident}
## What Went Well
- {item}
## What Could Be Improved
- {item}
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| {action} | {name} | {date} | {Open/Done} |
## Metrics
- **Time to detect:** {duration}
- **Time to contain:** {duration}
- **Time to resolve:** {duration}
- **Data exposed:** {scope or "None"}
10. Evidence Preservation¶
During any P1 or P2 incident:
- Do not reboot affected systems before preserving volatile evidence
- Capture full disk snapshot (LVM snapshot or filesystem-level copy)
- Export all relevant logs to a separate, secure location
- Record network connections (
ss -tunapl,iptables -L -n) - Capture running processes (
ps auxf,/procstate) - Timestamp and hash all evidence files (SHA-256)
- Maintain chain of custody documentation
11. Testing¶
| Test | Frequency | Method |
|---|---|---|
| Tabletop exercise | Annually | Walk through a P1 scenario with all role holders |
| Detection validation | Quarterly | Trigger test alerts and verify notification delivery |
| Runbook review | Semi-annually | Review and update all response procedures |
| Communication test | Annually | Send test notification through all escalation channels |
12. Compliance Mapping¶
| SOC 2 Criteria | Control |
|---|---|
| CC7.2 | Monitoring of system components for anomalies |
| CC7.3 | Evaluation of identified security events |
| CC7.4 | Incident response and containment |
| CC7.5 | Communication of incidents to affected parties |
| CC2.3 | Internal communication of security matters |