systems-architect

✓Verified·Scanned 2/18/2026

Design infrastructure, networks, and cloud systems with integration, reliability, and security patterns.

from clawhub.ai·v09bf12c·4.3 KB·0 installs

Scanned from 1.0.0 at 09bf12c · Transparency log ↗

$ vett add clawhub.ai/ivangdavila/systems-architect

Systems Architecture Rules

Infrastructure Design

Design for failure at every layer — hardware fails, networks partition, regions go down
Redundancy costs money, downtime costs more — calculate acceptable risk
Prefer managed services for undifferentiated work — run less, build more
Infrastructure as code from day one — manual changes drift and break
Immutable infrastructure beats patching — replace, don't repair

Cloud Architecture

Multi-AZ minimum, multi-region for critical systems — availability zones fail together sometimes
Right-size first, auto-scale second — baseline must be correct
Reserved capacity for steady load, spot/preemptible for bursts — cost optimization requires planning
Egress costs add up — keep traffic within regions when possible
Cloud vendor lock-in is real — abstract where escape matters, accept where it doesn't

Networking

Private subnets for workloads, public only for load balancers — minimize attack surface
VPC peering and transit gateways for multi-account — plan topology before scaling
DNS for service discovery — hardcoded IPs break migrations
Zero trust: authenticate and encrypt internal traffic — perimeter security isn't enough
Network segmentation limits blast radius — flat networks let attackers roam

Integration Patterns

APIs for synchronous, queues for asynchronous — match pattern to requirements
Event-driven for loose coupling — producers don't know consumers
Service mesh for complex microservices — observability and security at network layer
Rate limiting and backpressure protect systems — don't let slow consumers crash fast producers
Dead letter queues for failed messages — don't lose data, process later

Reliability

Define SLOs before building — what does "up" mean for this system?
Error budgets allow controlled risk — 99.9% means 8 hours downtime per year is acceptable
Blast radius reduction: cell-based architecture — limit how many users one failure affects
Chaos engineering in staging first — break things intentionally before production breaks accidentally
Runbooks for every alert — 3 AM isn't debugging time

Disaster Recovery

RTO (recovery time) and RPO (data loss) are business decisions — architect for the requirement
Backups aren't recovery until tested — restore regularly
Hot/warm/cold standby each have trade-offs — cost vs speed of recovery
Cross-region replication for critical data — single region is single point of failure
DR drills reveal real problems — plan meets reality

Security

Defense in depth: multiple barriers — one layer will fail
Least privilege for services too — not just users
Secrets management centralized — no secrets in code, config files, or environment variables in images
Audit logging for compliance and forensics — you'll need it after a breach
Patch aggressively — known vulnerabilities are actively exploited

Monitoring and Observability

Metrics, logs, and traces together — each tells part of the story
Alerting on symptoms, not causes — users down matters, CPU high might not
Dashboards for each service with golden signals — latency, traffic, errors, saturation
Distributed tracing across services — follow requests end to end
Log aggregation with retention policy — balance cost and forensic needs

Capacity Planning

Measure current baseline before projecting — can't scale what you don't measure
Load test to find breaking points — theory differs from reality
Capacity leads demand — scaling takes time, be ahead
Cost modeling for growth scenarios — 10x users is rarely 10x cost
Review quarterly at minimum — patterns change

Migration and Evolution

Strangler fig pattern for legacy replacement — route traffic gradually
Blue-green or canary for infrastructure changes — test in production safely
Database migrations are hardest — plan data migration separately
Rollback plans before rollout — assume failure, prepare for it
Communicate maintenance windows — surprises damage trust