aws-ecs-monitor
AWS ECS production health monitoring with CloudWatch log analysis — monitors ECS service health, ALB targets, SSL certificates, and provides deep CloudWatch log analysis for error categorization, restart detection, and production alerts.
AWS ECS Monitor
Production health monitoring and log analysis for AWS ECS services.
What It Does
- Health Checks: HTTP probes against your domain, ECS service status (desired vs running), ALB target group health, SSL certificate expiry
- Log Analysis: Pulls CloudWatch logs, categorizes errors (panics, fatals, OOM, timeouts, 5xx), detects container restarts, filters health check noise
- Auto-Diagnosis: Reads health status and automatically investigates failing services via log analysis
Prerequisites
awsCLI configured with appropriate IAM permissions:ecs:ListServices,ecs:DescribeServiceselasticloadbalancing:DescribeTargetGroups,elasticloadbalancing:DescribeTargetHealthlogs:FilterLogEvents,logs:DescribeLogGroups
curlfor HTTP health checkspython3for JSON processing and log analysisopensslfor SSL certificate checks (optional)
Configuration
All configuration is via environment variables:
| Variable | Required | Default | Description |
|---|---|---|---|
ECS_CLUSTER | Yes | — | ECS cluster name |
ECS_REGION | No | us-east-1 | AWS region |
ECS_DOMAIN | No | — | Domain for HTTP/SSL checks (skip if unset) |
ECS_SERVICES | No | auto-detect | Comma-separated service names to monitor |
ECS_HEALTH_STATE | No | ./data/ecs-health.json | Path to write health state JSON |
ECS_HEALTH_OUTDIR | No | ./data/ | Output directory for logs and alerts |
ECS_LOG_PATTERN | No | /ecs/{service} | CloudWatch log group pattern ({service} is replaced) |
ECS_HTTP_ENDPOINTS | No | — | Comma-separated name=url pairs for HTTP probes |
Scripts
scripts/ecs-health.sh — Health Monitor
# Full check
ECS_CLUSTER=my-cluster ECS_DOMAIN=example.com ./scripts/ecs-health.sh
# JSON output only
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --json
# Quiet mode (no alerts, just status file)
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --quiet
Exit codes: 0 = healthy, 1 = unhealthy/degraded, 2 = script error
scripts/cloudwatch-logs.sh — Log Analyzer
# Pull raw logs from a service
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh pull my-api --minutes 30
# Show errors across all services
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh errors all --minutes 120
# Deep analysis with error categorization
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose --minutes 60
# Detect container restarts
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh restarts my-api
# Auto-diagnose from health state file
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh auto-diagnose
# Summary across all services
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh summary --minutes 120
Options: --minutes N (default: 60), --json, --limit N (default: 200), --verbose
Auto-Detection
When ECS_SERVICES is not set, both scripts auto-detect services from the cluster:
aws ecs list-services --cluster $ECS_CLUSTER
Log groups are resolved by pattern (default /ecs/{service}). Override with ECS_LOG_PATTERN:
# If your log groups are /ecs/prod/my-api, /ecs/prod/my-frontend, etc.
ECS_LOG_PATTERN="/ecs/prod/{service}" ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose
Integration
The health monitor can trigger the log analyzer for auto-diagnosis when issues are detected. Set ECS_HEALTH_OUTDIR to a shared directory and run both scripts together:
export ECS_CLUSTER=my-cluster
export ECS_DOMAIN=example.com
export ECS_HEALTH_OUTDIR=./data
# Run health check (auto-triggers log analysis on failure)
./scripts/ecs-health.sh
# Or run log analysis independently
./scripts/cloudwatch-logs.sh auto-diagnose --minutes 30
Error Categories
The log analyzer classifies errors into:
panic— Go panicsfatal— Fatal errorsoom— Out of memorytimeout— Connection/request timeoutsconnection_error— Connection refused/resethttp_5xx— HTTP 500-level responsespython_traceback— Python tracebacksexception— Generic exceptionsauth_error— Permission/authorization failuresstructured_error— JSON-structured error logserror— Generic ERROR-level messages
Health check noise (GET/HEAD /health from ALB) is automatically filtered from error counts and HTTP status distribution.