runbook-creation
Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes.
About runbook-creation
runbook-creation is a Claude AI skill developed by aj-geddes. Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes. This powerful Claude Code plugin helps developers automate workflows and enhance productivity with intelligent AI assistance.
Why use runbook-creation? With 5 stars on GitHub, this skill has been trusted by developers worldwide. Install this Claude skill instantly to enhance your development workflow with AI-powered automation.
| name | runbook-creation |
| description | Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes. |
Runbook Creation
Overview
Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.
When to Use
- Incident response procedures
- Standard operating procedures (SOPs)
- On-call playbooks
- System maintenance guides
- Disaster recovery procedures
- Deployment runbooks
- Escalation procedures
- Service restoration guides
Incident Response Runbook Template
# Incident Response Runbook ## Quick Reference **Severity Levels:** - P0 (Critical): Complete outage, data loss, security breach - P1 (High): Major feature down, significant user impact - P2 (Medium): Minor feature degradation, limited user impact - P3 (Low): Cosmetic issues, minimal user impact **Response Times:** - P0: Immediate (24/7) - P1: 15 minutes (business hours), 1 hour (after hours) - P2: 4 hours (business hours) - P3: Next business day **Escalation Contacts:** - On-call Engineer: PagerDuty rotation - Engineering Manager: +1-555-0100 - VP Engineering: +1-555-0101 - CTO: +1-555-0102 ## Table of Contents 1. [Service Down](#service-down) 2. [Database Issues](#database-issues) 3. [High CPU/Memory Usage](#high-cpu-memory-usage) 4. [API Performance Degradation](#api-performance-degradation) 5. [Security Incidents](#security-incidents) 6. [Data Loss Recovery](#data-loss-recovery) 7. [Rollback Procedures](#rollback-procedures) --- ## Service Down ### Symptoms - Health check endpoint returning 500 errors - Users unable to access application - Load balancer showing all instances unhealthy - Alerts: `service_down`, `health_check_failed` ### Severity: P0 (Critical) ### Initial Response (5 minutes) 1. **Acknowledge the incident** ```bash # Acknowledge in PagerDuty # Post in #incidents Slack channel
-
Create incident channel
Create Slack channel: #incident-YYYY-MM-DD-service-down Post incident details and status updates -
Assess impact
# Check service status kubectl get pods -n production # Check recent deployments kubectl rollout history deployment/api -n production # Check logs kubectl logs -f deployment/api -n production --tail=100
Investigation Steps
Check Application Health
# 1. Check pod status kubectl get pods -n production -l app=api # Expected output: All pods Running # NAME READY STATUS RESTARTS AGE # api-7d8c9f5b6d-4xk2p 1/1 Running 0 2h # api-7d8c9f5b6d-7nm8r 1/1 Running 0 2h # 2. Check pod logs for errors kubectl logs -f deployment/api -n production --tail=100 | grep -i error # 3. Check application endpoints curl -v https://api.example.com/health curl -v https://api.example.com/api/v1/status # 4. Check database connectivity kubectl exec -it deployment/api -n production -- sh psql $DATABASE_URL -c "SELECT 1"
Check Infrastructure
# 1. Check load balancer aws elb describe-target-health \ --target-group-arn arn:aws:elasticloadbalancing:... \ --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \ --output table # 2. Check DNS resolution dig api.example.com nslookup api.example.com # 3. Check SSL certificates echo | openssl s_client -connect api.example.com:443 2>/dev/null | \ openssl x509 -noout -dates # 4. Check network connectivity kubectl exec -it deployment/api -n production -- \ curl -v https://database.example.com:5432
Check Database
# 1. Check database connections psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity" # 2. Check for locks psql $DATABASE_URL -c " SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0 " # 3. Check database size psql $DATABASE_URL -c " SELECT pg_size_pretty(pg_database_size(current_database())) " # 4. Check long-running queries psql $DATABASE_URL -c " SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10 "
Resolution Steps
Option 1: Restart Pods (Quick Fix)
# Restart all pods (rolling restart) kubectl rollout restart deployment/api -n production # Watch restart progress kubectl rollout status deployment/api -n production # Verify pods are healthy kubectl get pods -n production -l app=api
Option 2: Scale Up (If Overload)
# Check current replicas kubectl get deployment api -n production # Scale up kubectl scale deployment/api -n production --replicas=10 # Watch scaling kubectl get pods -n production -l app=api -w
Option 3: Rollback (If Bad Deploy)
# Check deployment history kubectl rollout history deployment/api -n production # Rollback to previous version kubectl rollout undo deployment/api -n production # Rollback to specific revision kubectl rollout undo deployment/api -n production --to-revision=5 # Verify rollback kubectl rollout status deployment/api -n production
Option 4: Database Connection Reset
# If database connection pool exhausted kubectl exec -it deployment/api -n production -- sh kill -HUP 1 # Reload process, reset connections # Or restart database connection pool psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'api' AND state = 'idle'"
Verification
# 1. Check health endpoint curl https://api.example.com/health # Expected: {"status": "healthy"} # 2. Check API endpoints curl https://api.example.com/api/v1/users # Expected: Valid JSON response # 3. Check metrics # Visit https://grafana.example.com # Verify: # - Error rate < 1% # - Response time < 500ms # - All pods healthy # 4. Check logs for errors kubectl logs deployment/api -n production --tail=100 | grep -i error # Expected: No new errors
Communication
Initial Update (within 5 minutes):
šØ INCIDENT: Service Down
Status: Investigating
Severity: P0
Impact: All users unable to access application
Start Time: 2025-01-15 14:30 UTC
We are investigating reports of users unable to access the application.
Our team is working to identify the root cause.
Next update in 15 minutes.
Progress Update (every 15 minutes):
š UPDATE: Service Down
Status: Identified
Root Cause: Database connection pool exhausted
Action: Restarting application pods
ETA: 5 minutes
We have identified the issue and are implementing a fix.
Resolution Update:
ā
RESOLVED: Service Down
Status: Resolved
Resolution: Restarted application pods, reset database connections
Duration: 23 minutes
The service is now fully operational. We are monitoring closely
and will conduct a post-mortem to prevent future occurrences.
Post-Incident
-
Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
-
Update monitoring
- Add alerts for this scenario
- Improve detection time
-
Update runbook
- Document any new findings
- Add shortcuts for faster resolution
Database Issues
High Connection Count
Symptoms:
- Database rejecting new connections
- Error: "too many connections"
- Alert:
db_connections_high
Quick Fix:
# 1. Check connection count psql $DATABASE_URL -c " SELECT count(*), application_name FROM pg_stat_activity GROUP BY application_name " # 2. Kill idle connections psql $DATABASE_URL -c " SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes' " # 3. Restart connection pools kubectl rollout restart deployment/api -n production
Slow Queries
Symptoms:
- API response times > 5 seconds
- Database CPU at 100%
- Alert:
slow_query_detected
Investigation:
-- Find slow queries SELECT pid, now() - query_start as duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10; -- Check for missing indexes SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_scan DESC LIMIT 10; -- Kill long-running query (if needed) SELECT pg_terminate_backend(12345); -- Replace with actual PID
High CPU/Memory Usage
Symptoms
- Pods being OOMKilled
- Response times increasing
- Alert:
high_memory_usage,high_cpu_usage
Investigation
# 1. Check pod resources kubectl top pods -n production # 2. Check resource limits kubectl describe pod <pod-name> -n production | grep -A 5 Limits # 3. Check for memory leaks kubectl logs deployment/api -n production | grep -i "out of memory" # 4. Profile application (if needed) kubectl exec -it <pod-name> -n production -- sh # Run profiler: node --inspect, py-spy, etc.
Resolution
# Option 1: Increase resources kubectl set resources deployment/api -n production \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=1000m,memory=2Gi # Option 2: Scale horizontally kubectl scale deployment/api -n production --replicas=6 # Option 3: Restart problematic pods kubectl delete pod <pod-name> -n production
Rollback Procedures
Application Rollback
# 1. List deployment history kubectl rollout history deployment/api -n production # 2. Check specific revision kubectl rollout history deployment/api -n production --revision=5 # 3. Rollback to previous kubectl rollout undo deployment/api -n production # 4. Rollback to specific revision kubectl rollout undo deployment/api -n production --to-revision=5 # 5. Verify rollback kubectl rollout status deployment/api -n production kubectl get pods -n production
Database Rollback
# 1. Check migration status npm run db:migrate:status # 2. Rollback last migration npm run db:migrate:undo # 3. Rollback to specific migration npm run db:migrate:undo --to 20250115120000-migration-name # 4. Verify database state psql $DATABASE_URL -c "\dt"
Escalation Path
-
Level 1 - On-call Engineer (You)
- Initial response and investigation
- Attempt standard fixes from runbook
-
Level 2 - Senior Engineers
- Escalate if not resolved in 30 minutes
- Escalate if issue is complex/unclear
- Contact via PagerDuty or Slack
-
Level 3 - Engineering Manager
- Escalate if not resolved in 1 hour
- Escalate if cross-team coordination needed
-
Level 4 - VP Engineering / CTO
- Escalate for P0 incidents > 2 hours
- Escalate for security breaches
- Escalate for data loss
Useful Commands
# Kubernetes kubectl get pods -n production kubectl logs -f <pod-name> -n production kubectl describe pod <pod-name> -n production kubectl exec -it <pod-name> -n production -- sh kubectl top pods -n production # Database psql $DATABASE_URL -c "SELECT version()" psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity" # AWS aws ecs list-tasks --cluster production aws rds describe-db-instances aws cloudwatch get-metric-statistics ... # Monitoring URLs # Grafana: https://grafana.example.com # Datadog: https://app.datadoghq.com # PagerDuty: https://example.pagerduty.com # Status Page: https://status.example.com
## Best Practices
### ā
DO
- Include quick reference section at top
- Provide exact commands to run
- Document expected outputs
- Include verification steps
- Add communication templates
- Define severity levels clearly
- Document escalation paths
- Include useful links and contacts
- Keep runbooks up-to-date
- Test runbooks regularly
- Include screenshots/diagrams
- Document common gotchas
### ā DON'T
- Use vague instructions
- Skip verification steps
- Forget to document prerequisites
- Assume knowledge of tools
- Skip communication guidelines
- Forget to update after incidents
## Resources
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Google SRE Book](https://sre.google/books/)
- [Atlassian Incident Handbook](https://www.atlassian.com/incident-management/handbook)

aj-geddes
useful-ai-prompts
Download Skill Files
View Installation GuideDownload the complete skill directory including SKILL.md and all related files