Tuần 9 - Ngày 4: Route 53 Failover Patterns
Mục tiêu học tập
- Hiểu Route 53 Failover routing
- Áp dụng Health Checks cho automatic failover
- Phân biệt Active-Passive vs Active-Active
- Hiểu DNS TTL impact on failover speed
1. Failover Routing Recap
Active-Passive
- Primary + Secondary record
- Health check on primary
- If primary unhealthy → traffic to secondary
Architecture
2. Health Checks
3 types
1. Endpoint Health Check
- HTTP/HTTPS/TCP probe
- ~15 AWS health checkers globally
- Interval: 10s (fast) or 30s (standard)
- Failure threshold: 1-10 (default 3)
- Path, port, host configurable
2. Calculated Health Check
- Combine 2-256 child health checks
- AND, OR, NOT logic
- Use case: "Healthy if 3+ regions healthy"
3. CloudWatch Alarm-based Health Check
- Trigger when CloudWatch alarm state
- Use case: alarm khi DynamoDB throttling → DNS failover
Cost
- Endpoint within AWS: free for first 50, then $0.75/check
- Endpoint outside AWS: $0.75/check
- Calculated: $1/check
- CloudWatch alarm-based: $1/check
3. Health Check Best Practices
Endpoint design
- Dedicated
/healthendpoint - Returns 200 if healthy, 503 if not
- Check dependencies (DB connection, cache, external API)
Frequency
- 10-second interval for critical apps (fast failover)
- 30-second for less critical (lower cost)
Failure threshold
- Higher (e.g., 5): less false positives, slower failover
- Lower (e.g., 2): faster failover, more false positives
- Recommended: 3-5
Multi-region health check
- Place health checkers in DIFFERENT regions than target
- Detect region-specific issues
4. DNS TTL and Failover Speed
Why TTL matters
- TTL = how long resolvers cache DNS response
- After failover, resolvers may serve OLD record until TTL expires
- → Effective failover time = health check detection + TTL expiry
Trade-off
- Low TTL (60 sec): fast failover, more DNS queries (cost)
- High TTL (24 hour): cheap, slow failover
Recommendation
- Stable apps: 300-3600s
- HA apps: 60s
- Before maintenance: pre-lower TTL 24h ahead
ALIAS record
- AWS-managed TTL (usually 60 sec)
- Optimal cho AWS resources
- Recommended over CNAME
5. Active-Passive Failover Setup
Steps
- Tạo health check trên primary ALB
- Tạo primary record:
- Routing policy: Failover
- Failover type: Primary
- Health check ID: từ step 1
- Value: ALB-us-east-1 alias
- Tạo secondary record:
- Routing policy: Failover
- Failover type: Secondary
- Health check optional
- Value: ALB-eu-west-1 alias
Behavior
- DNS query → R53 returns Primary IF healthy
- If unhealthy → returns Secondary
- When primary recovers → R53 returns Primary again
Combine với DR strategy
- Pilot Light / Warm Standby: passive resources in DR region
- Health check fail → DNS failover + (optionally) auto-scale DR
6. Active-Active Failover
Pattern: Multivalue Answer hoặc Weighted/Latency
- Both regions actively serving
- Each region has own health check
- If 1 region unhealthy → DNS không return its record
Weighted Routing với health checks
Latency-based với health checks
- Lowest-latency region returned
- Skip unhealthy regions
- Best for global apps
7. Multi-Region Failover Patterns
Pattern 1: Active-Passive with Aurora Global DB
Pattern 2: Active-Active with DynamoDB Global Tables
Pattern 3: Active-Active with Global Accelerator
8. Route 53 vs Global Accelerator
| Route 53 Failover | Global Accelerator | |
|---|---|---|
| Mechanism | DNS (TTL-dependent) | Anycast IP (TCP/UDP) |
| Failover time | Health check + TTL | < 1 minute (no DNS dependency) |
| Protocol | Any (DNS-resolved) | TCP, UDP |
| Static IP | No | Yes (2 anycast IPs) |
| Caching | DNS resolver | No |
| Cost | $0.50/zone + queries | $0.025/hour endpoint + data transfer |
When to use GA over Route 53
- Cần static IP cho whitelisting
- Non-HTTP traffic (gaming, IoT)
- Faster failover (< 1 min vs TTL-dependent)
- Reduce DNS-related latency
9. Failover Testing
Test procedure
- Manually disable primary (stop ALB target group health)
- Observe DNS failover (use
digor online DNS tool) - Measure time: health check detection + TTL expiry
- Verify app works on secondary
- Re-enable primary, observe failback
Tools
dig +short example.com(check current DNS)- AWS Console Route 53 health check status
- CloudWatch metrics on health check
Game day exercise
- Simulate region failure
- Document actual RTO/RPO
- Find gaps (e.g., manual steps that should be automated)
10. DNS Failover Limitations
Limitations
- DNS resolvers may cache (TTL)
- Browser DNS cache (separate from resolver, persistent across requests)
- CDN edge caches DNS too
- Some clients ignore TTL (rare)
Workarounds
- Use Global Accelerator (no DNS)
- Use CloudFront (origin failover at edge level)
- Use client-side retry logic with multiple endpoints
11. CloudFront Origin Failover
Định nghĩa
CloudFront Origin Failover = primary + secondary origin in CloudFront distribution.
Setup
Behavior
- Edge tries primary, if fails → secondary
- Sub-second failover (no DNS)
- Use case: HTTP/HTTPS apps with multi-region origins
vs Route 53 Failover
- CloudFront: faster failover, only HTTP/HTTPS, requires CloudFront
- Route 53: any protocol, slower (TTL-dependent)
12. Common Patterns
Pattern 1: Web app multi-region failover
Pattern 2: API endpoint with static IP
Pattern 3: Database failover
Câu hỏi ôn tập
-
Route 53 Failover yêu cầu gì cho primary record?
Xem đáp án
Primary record phải có Health Check associated. Route 53 monitor endpoint (HTTP/HTTPS/TCP) định kỳ (default 30s, hoặc 10s "Fast" health check). Khi primary health check fail (3 consecutive failures by default), Route 53 trả secondary record thay thế. Secondary record không bắt buộc có health check nhưng best practice là có. Health Check có thể monitor cả CloudWatch Alarm state.
-
DNS TTL impact failover time như thế nào?
Xem đáp án
TTL cao = failover chậm. Khi DNS trả IP mới (sau failover), clients còn cache TTL cũ tiếp tục dùng IP cũ cho đến khi TTL hết. Ví dụ TTL 300s → failover visible sau tối đa 5 phút với users. Low TTL (60s hoặc ít hơn) = failover nhanh hơn nhưng nhiều DNS queries hơn = cost cao hơn. Trước planned maintenance, giảm TTL → maintenance → failover nhanh → tăng TTL lại.
-
Global Accelerator failover khác Route 53 failover ở điểm gì?
Xem đáp án
Global Accelerator: failover ở network layer — anycast IPs tĩnh, routing thay đổi trong giây (< 1 phút, không phụ thuộc DNS TTL). Client không cần đổi IP. Route 53: failover qua DNS — tốc độ phụ thuộc TTL và DNS propagation (minutes). Global Accelerator phù hợp cho RTO requirements seconds; Route 53 failover phù hợp cho minutes. GA đắt hơn (~$18/accelerator/month + data fees).
-
CloudFront Origin Failover hoạt động ở layer nào?
Xem đáp án
Layer 7 (HTTP) — CloudFront tự động retry origin request sang secondary origin khi primary origin trả HTTP 500, 502, 503, 504 (configurable). Transparent cho clients — client không thấy failover. Setup: Origin Group với primary + secondary origins. CloudFront attempt primary, nếu fail → try secondary. Phù hợp cho active-passive origins với automatic content failover.
-
Tốc độ failover cho Active-Active so với Active-Passive?
Xem đáp án
Active-Active: failover gần như ngay lập tức — cả hai endpoints đang serve traffic, load balancer chỉ cần stop routing đến failed endpoint. RTO seconds. Active-Passive: cần promote passive region (launch instances, scale up, DNS update) — RTO minutes đến hours tùy strategy. Active-Active tốt hơn về RTO nhưng cần data sync real-time giữa regions (phức tạp và tốn kém hơn).
Bài tập thực hành
- Setup Failover routing: primary + secondary với health check
- Test: disable primary, measure failover time
- Setup Latency-based với health checks (active-active)
- Setup Global Accelerator với 2 regions
- Setup CloudFront Origin Failover
Tài liệu tham khảo chính thức
Tiếp theo: Aurora Global Database