Tuần 10 - Ngày 3: Spot Instance Strategies
Mục tiêu học tập
- Hiểu Spot pricing và interruption mechanism
- Áp dụng Spot Fleet với allocation strategies
- Build fault-tolerant Spot architectures
- Combine Spot + On-Demand trong ASG
1. Spot Recap
Spot Instances = unused EC2 capacity at up to 90% discount, AWS có thể reclaim với 2-min warning.
When to use
- Fault-tolerant workloads (can interrupt + retry)
- Stateless or checkpoint-based
- Big data, ML training, image rendering
- CI/CD workers, dev/test environments
When NOT to use
- Database (stateful)
- Mission-critical apps
- Workload không thể replace mid-execution
2. Spot Pricing
Pricing model
- AWS sets price based on capacity supply/demand
- You set max price willing to pay
- If Spot price ≤ max → instance runs
- If Spot price > max → instance terminated
Spot Price History
- View past 90 days
- Identify stable instance types/AZs
- Avoid volatile pools
Best practice
- Set max price = On-Demand price (giúp instance ít bị reclaim do giá)
- Spot interruption due to capacity, not price (most common reason)
3. Spot Interruption Behaviors
3 options
- Terminate (default): instance destroyed
- Stop: stop instance, keep EBS, restart later if Spot available
- Hibernate: save RAM to EBS, fast resume
Stop và Hibernate yêu cầu
- EBS-backed
- Specific instance types
Use case for each
- Terminate: stateless, can replace
- Stop: large state, OK to pause
- Hibernate: warm cache, fast resume needed
4. Spot Interruption Notice
Workflow
- AWS decides to reclaim instance
- 2-minute warning sent via instance metadata
- App should:
- Save state (S3, DynamoDB)
- Drain connections (deregister from ELB)
- Checkpoint progress
- After 2 min: instance terminated
Check warning
# Inside instance
curl http://169.254.169.254/latest/meta-data/spot/instance-action
# Returns terminate time if interruption coming
EventBridge integration
- Spot interruption event → Lambda → handle
- E.g., update database, notify monitoring
5. Spot Allocation Strategies (Spot Fleet & ASG)
Why important
- Pool = combination (instance type + AZ + OS + tenancy)
- Spread across pools = less interruption
Strategies
lowestPrice
- Cheapest pool
- Higher interruption risk
- Use: cost-sensitive batch
diversified
- Spread evenly across all pools
- Better availability
- Use: stateless workloads
capacityOptimized (recommended)
- Pool với largest available capacity
- Lowest interruption rate
- Use: most workloads, default recommendation
priceCapacityOptimized (newest, often best)
- Balance lowest price + lowest interruption
- Use: cost + reliability balance
Decision
- capacity-optimized: priority is uptime
- price-capacity-optimized: balance
- lowest-price: priority is cost (more risk)
6. Spot Fleet
Định nghĩa
Spot Fleet = managed group of Spot (+ optional On-Demand) instances meeting target capacity.
Features
- Mix multiple instance types/families
- Multiple Spot pools
- Optional On-Demand baseline
- Auto-replace terminated instances
Configuration
Spot Fleet:
Target capacity: 100 instances (or vCPUs or memory units)
On-Demand portion: 20 (always running)
Spot portion: 80 (price-sensitive)
Allocation strategy: capacity-optimized
Instance types pool:
- m5.large
- m5.xlarge
- m5a.large
- c5.large
- c5a.large
7. EC2 Auto Scaling với Mixed Instances
Mixed Instances Policy
- ASG launch nhiều instance types
- Mix Spot + On-Demand
- Specify base capacity (On-Demand) + percentage (Spot)
Example
ASG:
Min: 2, Desired: 10, Max: 50
On-Demand base: 2 (always)
On-Demand percentage above base: 20% (2 of every 10 above base)
Spot percentage: 80%
Instance types:
- m5.large
- m5a.large
- m5n.large
- m4.large
Spot allocation strategy: capacity-optimized
Result
- 2 baseline On-Demand
- Above 2: 20% On-Demand, 80% Spot
- E.g., at desired=10: 4 On-Demand + 6 Spot
8. Spot Best Practices
1. Design for interruption
- Stateless or checkpoint state
- Idempotent operations
- Use SQS for work distribution (resume if interrupted)
2. Diversify
- Multiple instance types, AZs
- Reduce dependency on single pool
3. Use capacity-optimized
- Lowest interruption
- Often default best choice
4. Combine with Spot
- Spot for elasticity (scale out/in)
- On-Demand + Spot blend
5. Test handle interruption
- Simulate via metadata or AWS Fault Injection Simulator
- Verify app saves state, fails gracefully
9. Use Cases
Big Data (Spark, EMR)
- EMR managed Spot + On-Demand cluster
- Master + core: On-Demand (stable)
- Task nodes: Spot (fault-tolerant compute)
CI/CD
- Jenkins agents on Spot
- Multiple instance types
- Fallback to On-Demand if Spot unavailable
Batch processing
- AWS Batch with Spot pricing
- Auto-retry on interruption
Containerized workloads
- ECS / EKS with Spot
- Spot nodes for stateless services
- Persistent state in RDS/EFS
Image/video rendering
- Render farm on Spot
- Each job checkpoint progress
- Resume if interrupted
10. Spot vs On-Demand Cost Example
Scenario: 100 m5.large for batch (4 hours/day)
Pure On-Demand
- 100 × $0.096/hr × 4 hr × 30 days = $1,152/month
Spot (70% discount, ~10% interruption overhead)
- 100 × $0.029/hr × 4 hr × 30 days × 1.1 = $382/month
- Saving: ~$770/month (66%)
11. Common Architecture Patterns
Pattern 1: Web app baseline + Spot scale
ASG:
- 5 On-Demand baseline (always)
- Auto-scale up to 50 with mix:
- 20% On-Demand (reliability)
- 80% Spot (cost)
- capacity-optimized strategy
Pattern 2: Big data pipeline
EMR cluster:
- Master + Core: On-Demand
- Task nodes: Spot Fleet (mix m5, c5, r5)
- Checkpoint to S3 frequently
Pattern 3: ML training
SageMaker Training:
- Use Managed Spot Training
- Save model checkpoints to S3
- Resume from checkpoint if interrupted
- 70-90% saving
Pattern 4: CI/CD farm
Jenkins or GitHub Actions runners:
- Auto-launch Spot instance per job
- Self-host runner on Spot
- Reduce idle cost vs always-on
12. Spot Limits
Service quotas
- Default: 256 vCPUs per region for Spot
- Request increase if needed
Region/AZ availability
- Some instance types not in all regions
- Availability changes (capacity)
Cannot use Spot for
- Dedicated Hosts
- Dedicated Instances
- Capacity Reservations
- Some specific instance types
Câu hỏi ôn tập
-
Spot interruption warning có bao lâu trước khi terminate?
Xem đáp án
2 phút (120 giây). AWS gửi Spot Instance interruption notice qua Instance Metadata (
/latest/meta-data/spot/termination-time) và EventBridge. Application phải checkpoint work, drain connections, save state trong 2 phút này. Best practice: polling metadata mỗi 5 giây, configure ASG lifecycle hooks để graceful drain từ LB trước khi terminate. -
Allocation strategy nào cho lowest interruption?
Xem đáp án
capacity-optimizedstrategy — AWS chọn pools có nhiều capacity nhất (và do đó ít likely bị interrupted nhất). Khác vớilowest-price(default) chọn cheapest pool.capacity-optimized-prioritizedkết hợp: AWS ưu tiên pools bạn chỉ định nhưng vẫn optimize capacity. Cho workloads quan trọng hơn là workloads chỉ cần lowest cost — batch ETL hay HPC jobs nên dùng capacity-optimized. -
Spot có thể Stop hoặc Hibernate được không?
Xem đáp án
Có — thay vì terminate, Spot có thể Stop (EBS-backed only — instance state lưu trên EBS) hoặc Hibernate (RAM dumped to EBS, instance stopped, resume về đúng state). Cấu hình interruption behavior khi request. Hibernate phù hợp cho workloads cần memory state preserved (ML inference models loaded). Cần EBS với dung lượng ≥ RAM size, và instance type/OS hỗ trợ hibernation.
-
Khi nào KHÔNG nên dùng Spot?
Xem đáp án
Không dùng Spot cho: (1) Databases (stateful, interruption gây data loss hoặc corruption), (2) Stateful web servers mà không có proper drain/failover, (3) Long-running jobs không checkpoint (mất hết progress khi interrupted), (4) Compliance workloads cần guaranteed availability, (5) Primary instances trong HA architectures không có On-Demand fallback. Spot cần fault-tolerant, stateless, hoặc checkpoint-capable workloads.
-
Mixed Instances Policy trong ASG là gì?
Xem đáp án
Mixed Instances Policy cho phép ASG dùng nhiều instance types từ cùng một Launch Template base. Kết hợp On-Demand + Spot trong cùng ASG: On-Demand cho baseline capacity (ổn định), Spot cho burst capacity (giảm cost). Cấu hình: % On-Demand base, danh sách instance types (AWS chọn best availability), allocation strategy. Giảm Spot interruption bằng cách diversify across nhiều pools.
Bài tập thực hành
- Launch Spot Instance với max price = On-Demand
- Listen for instance metadata interruption notice
- Tạo ASG với Mixed Instances Policy (50% Spot, 50% On-Demand)
- Test Spot Fleet với 3 instance types
- Run sample batch job on EMR with Spot task nodes
Tài liệu tham khảo chính thức
Tiếp theo: Decoupled Architectures