Skip to content

feat: Production readiness — monitoring, alerting, SLOs, runbooks, data retention, pentest, UAT#3

Merged
munisp merged 1 commit into
basefrom
devin/1781614437-production-readiness
Jun 16, 2026
Merged

feat: Production readiness — monitoring, alerting, SLOs, runbooks, data retention, pentest, UAT#3
munisp merged 1 commit into
basefrom
devin/1781614437-production-readiness

Conversation

@devin-ai-integration

Copy link
Copy Markdown
Contributor

Summary

Adds the remaining 8 production readiness items: monitoring stack, alerting rules, SLO definitions, incident response runbooks, data retention policy, authenticated penetration testing, and UAT scenarios. 2,506 lines across 21 files.

New structure:

ops/
├── monitoring/
│   ├── docker-compose.monitoring.yml     # Prometheus + Grafana + Alertmanager stack
│   ├── prometheus/
│   │   ├── prometheus.yml                # Scrape config (API + Go + Rust + Python services)
│   │   └── alerts.yml                    # 20 alert rules in 5 groups
│   ├── alertmanager/alertmanager.yml     # PagerDuty/Opsgenie/Slack routing
│   └── grafana/
│       ├── dashboards/                   # Transfer Ops (14 panels) + Infra (11 panels)
│       └── provisioning/                 # Auto-load datasources + dashboards
├── runbooks/
│   ├── incident-response.md             # SEV1-4 classification + lifecycle
│   ├── ledger-imbalance.md              # Zero-tolerance TigerBeetle balance check
│   ├── stuck-transfers.md               # Decision tree + retry/compensate/failover
│   ├── rail-provider-down.md            # Backup rail matrix + failover commands
│   ├── low-success-rate.md              # Error code → action mapping
│   ├── slow-delivery.md                 # p95 latency investigation
│   └── sanctions-screening-down.md      # Regulatory hold procedure (mandatory)
├── slo/service-level-objectives.yml     # 12 SLOs with error budget policy
└── data-retention/data-retention-policy.yml  # GDPR/NDPR/POPIA/PDPA compliance

Key design decisions:

  • Alerting routes critical financial alerts (ledger imbalance) to a separate PagerDuty escalation with 0s group wait and 15min repeat — money integrity is treated differently from latency/errors
  • SLOs include error budget policy: 75%→25% remaining triggers increasing reliability work allocation, exhausted = production freeze
  • Data retention distinguishes 8 categories with different periods: transactions 7yr (CBN), SARs 10yr (FATF), sessions 24hr, analytics 3yr (anonymized)
  • Sanctions screening runbook mandates transfer hold (not bypass) — regulatory requirement

QA additions:

  • qa/security/pentest-authenticated.sh — BOLA, privilege escalation, rate limiting, SSRF, XSS, financial limit bypass tests
  • qa/uat/uat-scenarios.sh — 5 stakeholder journeys (diaspora worker, merchant, employer, DeFi user, agent)

Link to Devin session: https://app.devin.ai/sessions/64d054ae77da41e9a2b74d8593fa635c
Requested by: @munisp

…ta retention, pentest, UAT

Monitoring & Alerting:
- Grafana dashboards: Transfer Operations (14 panels) + Infrastructure (11 panels)
- Prometheus alerting: 20 rules across 5 groups (financial, SLA, infra, compliance, settlement)
- Alertmanager config: PagerDuty (critical), Opsgenie (warning), Slack (info)
- Docker Compose monitoring stack (Prometheus + Grafana + Alertmanager)

SLO Definitions:
- 12 SLOs: fund delivery 99.9%, API availability 99.95%, ledger integrity 100%
- Settlement latency targets per rail (M-Pesa 10s, NIBSS 30s, SEPA 4h, SWIFT 48h)
- Error budget policy with escalation levels (25%/50%/75%/100% consumed)

Incident Response:
- 6 runbooks: ledger imbalance, stuck transfers, rail provider down, slow delivery,
  low success rate, sanctions screening down
- Incident response procedure with severity classification (SEV1-4)
- On-call schedule template and communication templates

Data Retention:
- GDPR/NDPR/POPIA/PDPA compliant retention policy
- 8 data categories with specific retention periods and deletion procedures
- DSAR implementation (right to access, erasure, portability)
- Automated retention jobs (weekly anonymization, monthly archival)

QA Additions:
- Authenticated penetration test runner (BOLA, privilege escalation, rate limiting)
- UAT scenarios for 5 stakeholder journeys (diaspora worker, merchant, employer, DeFi, agent)

Co-Authored-By: Patrick Munis <pmunis@gmail.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author
Original prompt from Patrick

https://drive.google.com/file/d/14K-94cZoOVgiYCUA-VympU-4_8IBqv2d/view?usp=sharing
extract the contents of the archive. List all the features of the platform

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

@munisp munisp merged commit c7e8a7e into base Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant