Question 1

"Perform operations as code" nghĩa là gì?

Accepted Answer

Toàn bộ infrastructure và operations được define dưới dạng code (Infrastructure as Code): CloudFormation, CDK, Terraform. Operations tasks như patching, scaling, incident response cũng code hóa qua SSM Automation, Lambda, EventBridge rules. Lợi ích: reproducible, version-controlled, testable, auditable, peer-reviewable. Loại bỏ manual operations prone to human error.

Question 2

Sự khác biệt giữa Runbook và Playbook?

Accepted Answer

Runbook: tài liệu hướng dẫn step-by-step cho một routine operational task cụ thể (restart service, disk cleanup, backup verify). Playbook: tài liệu hướng dẫn incident response — khi sự cố xảy ra (production down, security breach), playbook guide investigation và remediation. Runbook cho proactive/planned; Playbook cho reactive/unplanned. Cả hai nên được automated khi có thể (SSM Automation).

Question 3

Blue/Green deployment có ưu điểm gì?

Accepted Answer

(1) Zero-downtime deployment — swap traffic từ Blue (current) sang Green (new) ngay lập tức, (2) Instant rollback — nếu Green có vấn đề, switch traffic về Blue trong giây, (3) Testing Green với real traffic (small %) trước khi full cutover, (4) Production validation mà không affect users. AWS support: ALB listener rules, Route 53 weighted routing, CodeDeploy Blue/Green, Elastic Beanstalk. Chi phí: cần double resources trong thời gian deployment.

Question 4

AWS service nào dùng cho distributed tracing?

Accepted Answer

AWS X-Ray — instrument applications để trace requests across microservices, Lambda, API GW, DynamoDB. X-Ray SDK generates trace IDs, segments, subsegments. Service Map visualize dependencies và bottlenecks. Cold start detection, latency percentiles, error rates per service. Kết hợp với CloudWatch Container Insights cho ECS/EKS observability. CloudWatch Application Insights cũng tự động detect anomalies.

Question 5

Game Days được dùng để làm gì?

Accepted Answer

Chaos engineering / disaster simulations trong controlled environment. Team simulate failures (AZ outage, database fail, network partition, high CPU) để test: (1) Recovery procedures và runbooks hoạt động đúng không, (2) Monitoring và alerts fire đúng, (3) Team response time và coordination, (4) Infrastructure actually recovers như designed. Kết quả: identify gaps trước khi production incident thực sự xảy ra. AWS Fault Injection Service (FIS) tool cho Game Days.

Strategy	Risk	Rollback Time	Use Case
All-at-once	High	Minutes	Dev/Test
Rolling	Medium	Minutes	Production
Blue/Green	Low	Seconds	Mission Critical
Canary	Very Low	Seconds	High-risk changes

Runbook	Playbook
Step-by-step procedures	Decision trees
"How to do X"	"When Y happens, do Z"
Routine tasks	Incident response
Can be automated	May need judgment

Area	Services
Organization	Organizations, Control Tower, Service Catalog
Prepare	CloudFormation, CDK, CodePipeline, CodeDeploy
Operate	CloudWatch, X-Ray, Config, CloudTrail, SSM
Evolve	Well-Architected Tool, Trusted Advisor

Tuần 2 - Ngày 2: Operational Excellence Pillar

Mục tiêu học tập

1. Định nghĩa Operational Excellence

5 Focus Areas

2. Design Principles

1. Perform operations as code

2. Make frequent, small, reversible changes

3. Refine operations procedures frequently

4. Anticipate failure

5. Learn from all operational failures

3. Organization

Team Structure

AWS Services

4. Prepare

Infrastructure as Code

Deployment Strategies

AWS Services

5. Operate

Monitoring Hierarchy

Runbook vs Playbook

AWS Services for Operations

6. Evolve

Continuous Improvement Cycle

Metrics to Track

Game Days

7. Key AWS Services Summary

8. Common Exam Questions

Question Type 1: Tool Selection

Question Type 2: Best Practice

Question Type 3: Monitoring

9. Câu hỏi ôn tập

10. Bài tập thực hành

Tài liệu tham khảo chính thức