Tuần 7 - Ngày 4: AWS Step Functions
Mục tiêu học tập
- Hiểu Step Functions: serverless workflow orchestration
- Phân biệt Standard vs Express workflows
- Nắm state types và error handling
- Áp dụng cho ETL, ML pipelines, microservices
1. Tổng quan Step Functions
AWS Step Functions = serverless visual workflow service orchestrate AWS services into business processes.
Đặc điểm
- Visual workflow editor + JSON definition (ASL - Amazon States Language)
- 200+ AWS service integrations
- Built-in error handling (retry, catch)
- State management (Step Functions tracks state, no need to write own)
- Audit trail (full execution history)
- Long-running: up to 1 year (Standard workflow)
Use cases
- Order processing workflows
- ETL pipelines
- ML training/inference pipelines
- Microservice coordination
- Approval workflows (with human task)
- Saga pattern (distributed transactions)
2. Workflow Types
Standard Workflow
- Long-running: up to 1 year
- Exactly-once execution
- Audit history retained 90 days
- Pricing: per state transition ($25/M)
- Use case: long-running business processes, batch ETL
Express Workflow
- Short-running: up to 5 minutes
- At-least-once execution (may execute twice)
- CloudWatch Logs (no audit history)
- Pricing: per execution + duration (much cheaper than Standard)
- Use case: high-volume short workflows (IoT, mobile, streaming)
Express variants
- Sync Express: caller waits for result (like REST)
- Async Express: fire-and-forget (like SNS)
Comparison
| Standard | Express | |
|---|---|---|
| Duration | Up to 1 year | Up to 5 min |
| Execution rate | 2K/sec | 100K/sec |
| Pricing | $25/M transitions | $1/M executions + $0.10/GB-hour |
| History | 90 days | CloudWatch logs |
| Use case | Long workflows | High-volume short |
3. States Types
Task
- Execute single unit of work
- Integrations: Lambda, ECS task, AWS SDK, Activity (worker)
Pass
- Pass input to output (transformation)
- Useful for testing, transforming data
Wait
- Wait for time period or until timestamp
Choice
- Conditional branching (if/else)
Parallel
- Execute multiple branches in parallel
- Wait for all to complete
Map
- Execute task for each item in array (parallel)
- Like
for-eachloop
Succeed / Fail
- Terminate workflow with success/failure
4. ASL Example
Simple Lambda task
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111:function:ValidateOrder",
"Next": "ChargePayment"
},
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111:function:ChargePayment",
"End": true
}
}
}
With error handling
{
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111:function:ChargePayment",
"Retry": [{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}],
"Catch": [{
"ErrorEquals": ["PaymentDeclined"],
"Next": "NotifyCustomer"
}],
"Next": "ShipOrder"
}
}
5. Error Handling
Retry
- Retry on specific errors
- Configurable: interval, max attempts, backoff rate
- Common errors:
States.ALL: catch allStates.TimeoutLambda.ServiceException- Custom error names from Lambda
Catch
- On error, transition to specific state
- Like try-catch in code
- Can chain catches for different error types
Pattern
6. Service Integration Patterns
Request-Response (default)
- Step Functions calls service, gets response immediately
- E.g., Lambda invoke, S3 operations
Run a Job (.sync)
- Step Functions submits job, waits for completion
- E.g., ECS task, EMR step, Glue job, SageMaker training
Wait for a Callback (.waitForTaskToken)
- Step Functions pauses, waits for external system to callback
- E.g., Human approval, manual review
- Token returned via
SendTaskSuccessorSendTaskFailureAPI
Example: Human approval
{
"ApprovalRequired": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
"Parameters": {
"TopicArn": "arn:aws:sns:...",
"Message": {
"TaskToken.$": "$$.Task.Token",
"OrderId.$": "$.orderId"
}
},
"Next": "ProcessApproval"
}
}
Human approver gets SNS, clicks link → app calls SendTaskSuccess(token, output) → workflow resumes.
7. Map State (Iteration)
Inline Map
- Process array in parallel (within workflow definition)
- Up to 40 concurrent iterations
- Use case: process up to 40 items
Distributed Map (2022+)
- Process millions of items in parallel
- Source: S3 objects, CSV/JSON list, inventory
- Up to 10,000 concurrent
- Use case: process large dataset
Example: Process S3 inventory
{
"ProcessFiles": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": {
"Bucket": "my-bucket",
"Prefix": "input/"
}
},
"MaxConcurrency": 1000,
"ItemProcessor": {
"ProcessorConfig": { "Mode": "DISTRIBUTED" },
"StartAt": "ProcessOneFile",
"States": {
"ProcessOneFile": {
"Type": "Task",
"Resource": "arn:aws:lambda:...:function:ProcessFile",
"End": true
}
}
},
"End": true
}
}
8. Input/Output Processing
Filtering
- InputPath: select subset of input
- Parameters: construct new JSON for task
- ResultSelector: select subset of task output
- ResultPath: where to put task output in state's output
- OutputPath: select subset of state's output for next state
Example
State input: { "user": "alice", "order": { "id": 123 } }
InputPath: $.order
→ Task receives: { "id": 123 }
Task returns: { "status": "OK" }
ResultPath: $.result
→ State output: { "user": "alice", "order": {...}, "result": { "status": "OK" } }
9. Activities (Worker Pattern)
Định nghĩa
Activity = task processed by external worker (EC2, on-prem) instead of Lambda.
Workflow
- State Machine has Activity task
- Worker polls Step Functions for tasks (
GetActivityTask) - Worker processes
- Worker calls
SendTaskSuccess/SendTaskFailure
Use case
- Long-running tasks (> 15 min Lambda limit)
- On-prem workers
- Custom worker pool
10. Common Patterns
Pattern 1: Order processing
Pattern 2: ML pipeline
Prepare Data (Glue) → Train Model (SageMaker .sync)
→ Deploy Endpoint → Test → Promote to production
Pattern 3: ETL with Map
Trigger by S3 event → List files in S3 → Map (parallel process each file)
→ Aggregate results → Load to Redshift
Pattern 4: Saga (distributed transactions)
11. Security
IAM
- State machine has IAM role
- Role permissions for invoking targets (Lambda, ECS, etc.)
- Cross-account: target accounts grant access
Encryption
- Execution history: encrypted with AWS-managed keys
- Can use CMK for customer-managed encryption
VPC
- Tasks (Lambda, ECS) in VPC if needed
- Step Functions itself is regional service, no VPC
12. Step Functions vs Lambda Choreography
Choreography (Lambdas calling Lambdas)
- Each Lambda decides next step
- Pros: simple for small flow
- Cons: hard to debug, scattered logic, no central view
Orchestration (Step Functions)
- Central workflow definition
- Pros: visual flow, centralized error handling, easy to debug
- Cons: extra cost (state transitions)
Decision
- Few steps, simple flow: Lambda chain
- Complex flow, error handling, audit: Step Functions
- Long-running, human task: Step Functions
13. Pricing
Standard
- $25 per million state transitions
- Free tier: 4K transitions/month
Express
- $1 per million executions
- $0.10 per GB-hour duration
- Cheaper for high-volume short workflows
Câu hỏi ôn tập
-
Standard và Express workflow khác nhau ở điểm gì?
Xem đáp án
Standard: max 1 năm execution, exactly-once execution, audit history trong console, tính phí per state transition (
$0.025/1000 transitions). Phù hợp cho long-running business processes. Express: max 5 phút, at-least-once (async) hoặc at-most-once (sync), tính phí per duration + invocations ($1/million invocations). Phù hợp cho high-volume event processing, IoT, streaming. Express rẻ hơn nhiều cho workloads ngắn. -
State type nào dùng để branch conditional?
Xem đáp án
Choice state — evaluate conditions dựa trên input và route đến state khác nhau (if/else/switch). Không thể timeout hay có retry. Các state types khác: Task (gọi resource), Wait (delay), Pass (transform), Parallel (concurrent branches), Map (iterate array), Succeed, Fail. Choice + Parallel + Map là 3 state quan trọng nhất cho workflow design.
-
.waitForTaskTokendùng cho use case nào?Xem đáp án
Cho phép pause workflow và đợi external callback. Step Functions generate task token, include trong task input. External system (human approval, 3rd party API, legacy system) xử lý xong gọi
SendTaskSuccesshoặcSendTaskFailurevới token để resume. Use case điển hình: human approval workflow (gửi email, chờ người approve/reject), kết nối với on-premises systems cần manual step. -
Map state dùng để làm gì?
Xem đáp án
Map state iterate qua một array và chạy cùng sub-workflow cho mỗi item — parallel (concurrent) hoặc sequential với
MaxConcurrency. Ví dụ: process 100 files cùng lúc, send 50 emails, validate array of records.MaxConcurrency=0= unlimited parallel;MaxConcurrency=1= sequential. Kết quả trả về array tương ứng. Giảm thời gian xử lý batch đáng kể so với sequential Lambda. -
Khi nào nên dùng Step Functions thay Lambda chaining?
Xem đáp án
Dùng Step Functions khi: (1) Workflow phức tạp với nhiều states, branches, error handling — Lambda chaining code trở nên unmanageable, (2) Cần visibility/audit trail — xem execution history, debug failed step, (3) Cần retry logic với exponential backoff per step, (4) Workflow dài hơn 15 phút (Lambda limit), (5) Cần human approval step (.waitForTaskToken), (6) Cần parallel + map processing. Lambda chaining phù hợp cho simple 2-3 step linear flows.
Bài tập thực hành
- Tạo Standard workflow: 3-step Lambda chain
- Add Retry + Catch cho 1 task
- Tạo Parallel state chạy 2 Lambdas đồng thời
- Test Map state với 10 items
- Setup .waitForTaskToken: SNS + Lambda callback
- Tạo Express workflow, test high-frequency execution
Tài liệu tham khảo chính thức
Tiếp theo: API Gateway