Lesson 4Runbook

Operational Practices

เครื่องมือ monitoring จะมีคุณค่าก็ต่อเมื่อทีมรู้ว่าจะดู metric ไหน, alert ไหนสำคัญ, ใครรับผิดชอบ และต้องทำอะไรเมื่อเกิด incident

กำหนด metric ที่สำคัญ

เริ่มจาก user impact ไม่ใช่ resource utilization อย่างเดียว เช่น latency, error rate, availability, queue age, failed jobs, database connection saturation และ cost anomaly จากนั้นค่อย map ไป resource metrics เช่น CPU, memory, disk, IOPS และ network

Actionable alerts

Alert ที่ดีต้องบอกปัญหาที่ต้อง action ได้ มี owner, severity, dashboard link และ runbook link ถ้า alert ดังบ่อยแต่ไม่ต้องทำอะไร ทีมจะเริ่ม ignore alert นั้น

Alert name: API 5xx rate high
Severity: SEV-2
Owner: platform-oncall
Signal: 5xx rate > 2% for 10 minutes
Action: check deployment, ALB target health, app logs, database connections
Runbook: /runbooks/api-5xx.md

Runbook

Runbook คือขั้นตอนตรวจสอบและแก้ incident แบบซ้ำได้ ควรมี symptoms, dashboards, commands, rollback path, escalation contact และ cleanup/verification steps

Log retention และ sensitive data

กำหนด retention ตาม value และ compliance เช่น 7 วันสำหรับ debug, 90 วันสำหรับ app ops, 1 ปีสำหรับ audit ตาม requirement
อย่า log password, access token, secret, private key, full credit card หรือ PII ที่ไม่จำเป็น
ใช้ structured logs และ correlation/request IDs เพื่อ tracing ระหว่าง services

Common mistakes

ตั้ง alert ทุก metric จนเกิด alert fatigue
ไม่มี runbook ทำให้ on-call ต้องเดาทุกครั้ง
ไม่ทดสอบ alarm path จึงไม่รู้ว่า SNS/email/incident tool ส่งจริงหรือไม่
เก็บ logs ถาวรโดยไม่แยก debug logs กับ audit logs

Review questions

ทำไม latency/error rate มักสำคัญกว่า CPU เพียงอย่างเดียว?
Runbook ที่ดีควรมีอะไรบ้าง?
Alert fatigue เกิดจากอะไร?