Tuần 2 - Ngày 2: Operational Excellence Pillar
Mục tiêu học tập
- Hiểu design principles của Operational Excellence
- Nắm các best practices cho operations
- Biết các AWS services hỗ trợ pillar này
1. Định nghĩa Operational Excellence
Khả năng chạy và monitor systems để deliver business value và liên tục cải tiến supporting processes và procedures.
5 Focus Areas
- Organization
- Prepare
- Operate
- Evolve
2. Design Principles
1. Perform operations as code
Trước (Manual): Sau (Infrastructure as Code):
- SSH vào server - CloudFormation templates
- Manually configure - Terraform
- Document steps - AWS CDK
- Human error prone - Version controlled
- Repeatable
2. Make frequent, small, reversible changes
3. Refine operations procedures frequently
- Regularly review procedures
- Update based on lessons learned
- Conduct game days
4. Anticipate failure
- Pre-mortem exercises
- Test failure scenarios
- Chaos engineering
5. Learn from all operational failures
- Post-incident reviews (PIR)
- Share lessons across teams
- Implement improvements
3. Organization
Team Structure
AWS Services
- AWS Organizations: Quản lý multiple accounts
- AWS Control Tower: Setup landing zone
- Service Catalog: Standardized products
4. Prepare
Infrastructure as Code
# CloudFormation Example
AWSTemplateFormatVersion: '2010-09-09'
Resources:
WebServer:
Type: AWS::EC2::Instance
Properties:
InstanceType: t3.micro
ImageId: ami-0123456789abcdef0
Tags:
- Key: Environment
Value: Production
Deployment Strategies
| Strategy | Risk | Rollback Time | Use Case |
|---|---|---|---|
| All-at-once | High | Minutes | Dev/Test |
| Rolling | Medium | Minutes | Production |
| Blue/Green | Low | Seconds | Mission Critical |
| Canary | Very Low | Seconds | High-risk changes |
AWS Services
- CloudFormation: Infrastructure as Code
- AWS CDK: Code-first approach
- CodePipeline: CI/CD
- CodeDeploy: Deployment automation
- Systems Manager: Configuration management
5. Operate
Monitoring Hierarchy
Runbook vs Playbook
| Runbook | Playbook |
|---|---|
| Step-by-step procedures | Decision trees |
| "How to do X" | "When Y happens, do Z" |
| Routine tasks | Incident response |
| Can be automated | May need judgment |
AWS Services for Operations
- CloudWatch: Metrics, Logs, Alarms, Dashboards
- X-Ray: Distributed tracing
- AWS Config: Configuration tracking
- CloudTrail: API logging
- EventBridge: Event routing
- Systems Manager: Operational tasks
6. Evolve
Continuous Improvement Cycle
Metrics to Track
- Deployment frequency: Số lần deploy
- Lead time: Thời gian từ commit đến production
- MTTR: Mean Time To Recovery
- Change failure rate: % deployments fail
Game Days
- Simulate failures in controlled environment
- Test runbooks and procedures
- Build team confidence
- Example: Chaos Engineering với AWS Fault Injection Simulator
7. Key AWS Services Summary
| Area | Services |
|---|---|
| Organization | Organizations, Control Tower, Service Catalog |
| Prepare | CloudFormation, CDK, CodePipeline, CodeDeploy |
| Operate | CloudWatch, X-Ray, Config, CloudTrail, SSM |
| Evolve | Well-Architected Tool, Trusted Advisor |
8. Common Exam Questions
Question Type 1: Tool Selection
Company cần automate deployment với zero-downtime. Service nào?
- Answer: CodeDeploy với Blue/Green deployment
Question Type 2: Best Practice
Làm sao để ensure consistent configuration across environments?
- Answer: Infrastructure as Code (CloudFormation/CDK)
Question Type 3: Monitoring
Cần trace requests across microservices. Service nào?
- Answer: AWS X-Ray
9. Câu hỏi ôn tập
-
"Perform operations as code" nghĩa là gì?
Xem đáp án
Toàn bộ infrastructure và operations được define dưới dạng code (Infrastructure as Code): CloudFormation, CDK, Terraform. Operations tasks như patching, scaling, incident response cũng code hóa qua SSM Automation, Lambda, EventBridge rules. Lợi ích: reproducible, version-controlled, testable, auditable, peer-reviewable. Loại bỏ manual operations prone to human error.
-
Sự khác biệt giữa Runbook và Playbook?
Xem đáp án
Runbook: tài liệu hướng dẫn step-by-step cho một routine operational task cụ thể (restart service, disk cleanup, backup verify). Playbook: tài liệu hướng dẫn incident response — khi sự cố xảy ra (production down, security breach), playbook guide investigation và remediation. Runbook cho proactive/planned; Playbook cho reactive/unplanned. Cả hai nên được automated khi có thể (SSM Automation).
-
Blue/Green deployment có ưu điểm gì?
Xem đáp án
(1) Zero-downtime deployment — swap traffic từ Blue (current) sang Green (new) ngay lập tức, (2) Instant rollback — nếu Green có vấn đề, switch traffic về Blue trong giây, (3) Testing Green với real traffic (small %) trước khi full cutover, (4) Production validation mà không affect users. AWS support: ALB listener rules, Route 53 weighted routing, CodeDeploy Blue/Green, Elastic Beanstalk. Chi phí: cần double resources trong thời gian deployment.
-
AWS service nào dùng cho distributed tracing?
Xem đáp án
AWS X-Ray — instrument applications để trace requests across microservices, Lambda, API GW, DynamoDB. X-Ray SDK generates trace IDs, segments, subsegments. Service Map visualize dependencies và bottlenecks. Cold start detection, latency percentiles, error rates per service. Kết hợp với CloudWatch Container Insights cho ECS/EKS observability. CloudWatch Application Insights cũng tự động detect anomalies.
-
Game Days được dùng để làm gì?
Xem đáp án
Chaos engineering / disaster simulations trong controlled environment. Team simulate failures (AZ outage, database fail, network partition, high CPU) để test: (1) Recovery procedures và runbooks hoạt động đúng không, (2) Monitoring và alerts fire đúng, (3) Team response time và coordination, (4) Infrastructure actually recovers như designed. Kết quả: identify gaps trước khi production incident thực sự xảy ra. AWS Fault Injection Service (FIS) tool cho Game Days.
10. Bài tập thực hành
- Tạo CloudFormation stack đơn giản
- Setup CloudWatch Dashboard
- Tạo CloudWatch Alarm cho EC2 CPU
- Explore AWS Systems Manager features
Tài liệu tham khảo chính thức
Ngày tiếp theo: Security Pillar