</>Học Dev
Bài học

Tuần 7 - Ngày 4: AWS Step Functions

Tuần 7 – Ngày 4

Tuần 7 - Ngày 4: AWS Step Functions

Mục tiêu học tập

  • Hiểu Step Functions: serverless workflow orchestration
  • Phân biệt Standard vs Express workflows
  • Nắm state types và error handling
  • Áp dụng cho ETL, ML pipelines, microservices

1. Tổng quan Step Functions

AWS Step Functions = serverless visual workflow service orchestrate AWS services into business processes.

Đặc điểm

  • Visual workflow editor + JSON definition (ASL - Amazon States Language)
  • 200+ AWS service integrations
  • Built-in error handling (retry, catch)
  • State management (Step Functions tracks state, no need to write own)
  • Audit trail (full execution history)
  • Long-running: up to 1 year (Standard workflow)

Use cases

  • Order processing workflows
  • ETL pipelines
  • ML training/inference pipelines
  • Microservice coordination
  • Approval workflows (with human task)
  • Saga pattern (distributed transactions)

2. Workflow Types

Standard Workflow

  • Long-running: up to 1 year
  • Exactly-once execution
  • Audit history retained 90 days
  • Pricing: per state transition ($25/M)
  • Use case: long-running business processes, batch ETL

Express Workflow

  • Short-running: up to 5 minutes
  • At-least-once execution (may execute twice)
  • CloudWatch Logs (no audit history)
  • Pricing: per execution + duration (much cheaper than Standard)
  • Use case: high-volume short workflows (IoT, mobile, streaming)

Express variants

  • Sync Express: caller waits for result (like REST)
  • Async Express: fire-and-forget (like SNS)

Comparison

StandardExpress
DurationUp to 1 yearUp to 5 min
Execution rate2K/sec100K/sec
Pricing$25/M transitions$1/M executions + $0.10/GB-hour
History90 daysCloudWatch logs
Use caseLong workflowsHigh-volume short

3. States Types

Task

  • Execute single unit of work
  • Integrations: Lambda, ECS task, AWS SDK, Activity (worker)

Pass

  • Pass input to output (transformation)
  • Useful for testing, transforming data

Wait

  • Wait for time period or until timestamp

Choice

  • Conditional branching (if/else)

Parallel

  • Execute multiple branches in parallel
  • Wait for all to complete

Map

  • Execute task for each item in array (parallel)
  • Like for-each loop

Succeed / Fail

  • Terminate workflow with success/failure

4. ASL Example

Simple Lambda task

{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:111:function:ValidateOrder",
      "Next": "ChargePayment"
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:111:function:ChargePayment",
      "End": true
    }
  }
}

With error handling

{
  "ChargePayment": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:us-east-1:111:function:ChargePayment",
    "Retry": [{
      "ErrorEquals": ["Lambda.ServiceException"],
      "IntervalSeconds": 2,
      "MaxAttempts": 3,
      "BackoffRate": 2.0
    }],
    "Catch": [{
      "ErrorEquals": ["PaymentDeclined"],
      "Next": "NotifyCustomer"
    }],
    "Next": "ShipOrder"
  }
}

5. Error Handling

Retry

  • Retry on specific errors
  • Configurable: interval, max attempts, backoff rate
  • Common errors:
    • States.ALL: catch all
    • States.Timeout
    • Lambda.ServiceException
    • Custom error names from Lambda

Catch

  • On error, transition to specific state
  • Like try-catch in code
  • Can chain catches for different error types

Pattern

TryTaskRetry(transienterrors)Catch(businesserrorshandledifferently)

6. Service Integration Patterns

Request-Response (default)

  • Step Functions calls service, gets response immediately
  • E.g., Lambda invoke, S3 operations

Run a Job (.sync)

  • Step Functions submits job, waits for completion
  • E.g., ECS task, EMR step, Glue job, SageMaker training

Wait for a Callback (.waitForTaskToken)

  • Step Functions pauses, waits for external system to callback
  • E.g., Human approval, manual review
  • Token returned via SendTaskSuccess or SendTaskFailure API

Example: Human approval

{
  "ApprovalRequired": {
    "Type": "Task",
    "Resource": "arn:aws:states:::sns:publish.waitForTaskToken",
    "Parameters": {
      "TopicArn": "arn:aws:sns:...",
      "Message": {
        "TaskToken.$": "$$.Task.Token",
        "OrderId.$": "$.orderId"
      }
    },
    "Next": "ProcessApproval"
  }
}

Human approver gets SNS, clicks link → app calls SendTaskSuccess(token, output) → workflow resumes.

7. Map State (Iteration)

Inline Map

  • Process array in parallel (within workflow definition)
  • Up to 40 concurrent iterations
  • Use case: process up to 40 items

Distributed Map (2022+)

  • Process millions of items in parallel
  • Source: S3 objects, CSV/JSON list, inventory
  • Up to 10,000 concurrent
  • Use case: process large dataset

Example: Process S3 inventory

{
  "ProcessFiles": {
    "Type": "Map",
    "ItemReader": {
      "Resource": "arn:aws:states:::s3:listObjectsV2",
      "Parameters": {
        "Bucket": "my-bucket",
        "Prefix": "input/"
      }
    },
    "MaxConcurrency": 1000,
    "ItemProcessor": {
      "ProcessorConfig": { "Mode": "DISTRIBUTED" },
      "StartAt": "ProcessOneFile",
      "States": {
        "ProcessOneFile": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:...:function:ProcessFile",
          "End": true
        }
      }
    },
    "End": true
  }
}

8. Input/Output Processing

Filtering

  • InputPath: select subset of input
  • Parameters: construct new JSON for task
  • ResultSelector: select subset of task output
  • ResultPath: where to put task output in state's output
  • OutputPath: select subset of state's output for next state

Example

State input: { "user": "alice", "order": { "id": 123 } }

InputPath: $.order
→ Task receives: { "id": 123 }

Task returns: { "status": "OK" }

ResultPath: $.result
→ State output: { "user": "alice", "order": {...}, "result": { "status": "OK" } }

9. Activities (Worker Pattern)

Định nghĩa

Activity = task processed by external worker (EC2, on-prem) instead of Lambda.

Workflow

  1. State Machine has Activity task
  2. Worker polls Step Functions for tasks (GetActivityTask)
  3. Worker processes
  4. Worker calls SendTaskSuccess / SendTaskFailure

Use case

  • Long-running tasks (> 15 min Lambda limit)
  • On-prem workers
  • Custom worker pool

10. Common Patterns

Pattern 1: Order processing

ValidateAuthorizePaymentShipSendConfirmationCatchCatch(declined)Catch(outofstock)NotifyRefundBackorder

Pattern 2: ML pipeline

Prepare Data (Glue) → Train Model (SageMaker .sync)
                     → Deploy Endpoint → Test → Promote to production

Pattern 3: ETL with Map

Trigger by S3 event → List files in S3 → Map (parallel process each file)
                                       → Aggregate results → Load to Redshift

Pattern 4: Saga (distributed transactions)

Step1:BookhotelStep2:BookflightStep3:Chargepaymentonfailurerollbackonfailurerollback

11. Security

IAM

  • State machine has IAM role
  • Role permissions for invoking targets (Lambda, ECS, etc.)
  • Cross-account: target accounts grant access

Encryption

  • Execution history: encrypted with AWS-managed keys
  • Can use CMK for customer-managed encryption

VPC

  • Tasks (Lambda, ECS) in VPC if needed
  • Step Functions itself is regional service, no VPC

12. Step Functions vs Lambda Choreography

Choreography (Lambdas calling Lambdas)

  • Each Lambda decides next step
  • Pros: simple for small flow
  • Cons: hard to debug, scattered logic, no central view

Orchestration (Step Functions)

  • Central workflow definition
  • Pros: visual flow, centralized error handling, easy to debug
  • Cons: extra cost (state transitions)

Decision

  • Few steps, simple flow: Lambda chain
  • Complex flow, error handling, audit: Step Functions
  • Long-running, human task: Step Functions

13. Pricing

Standard

  • $25 per million state transitions
  • Free tier: 4K transitions/month

Express

  • $1 per million executions
  • $0.10 per GB-hour duration
  • Cheaper for high-volume short workflows

Câu hỏi ôn tập

  1. Standard và Express workflow khác nhau ở điểm gì?

    Xem đáp án

    Standard: max 1 năm execution, exactly-once execution, audit history trong console, tính phí per state transition ($0.025/1000 transitions). Phù hợp cho long-running business processes. Express: max 5 phút, at-least-once (async) hoặc at-most-once (sync), tính phí per duration + invocations ($1/million invocations). Phù hợp cho high-volume event processing, IoT, streaming. Express rẻ hơn nhiều cho workloads ngắn.

  2. State type nào dùng để branch conditional?

    Xem đáp án

    Choice state — evaluate conditions dựa trên input và route đến state khác nhau (if/else/switch). Không thể timeout hay có retry. Các state types khác: Task (gọi resource), Wait (delay), Pass (transform), Parallel (concurrent branches), Map (iterate array), Succeed, Fail. Choice + Parallel + Map là 3 state quan trọng nhất cho workflow design.

  3. .waitForTaskToken dùng cho use case nào?

    Xem đáp án

    Cho phép pause workflow và đợi external callback. Step Functions generate task token, include trong task input. External system (human approval, 3rd party API, legacy system) xử lý xong gọi SendTaskSuccess hoặc SendTaskFailure với token để resume. Use case điển hình: human approval workflow (gửi email, chờ người approve/reject), kết nối với on-premises systems cần manual step.

  4. Map state dùng để làm gì?

    Xem đáp án

    Map state iterate qua một array và chạy cùng sub-workflow cho mỗi item — parallel (concurrent) hoặc sequential với MaxConcurrency. Ví dụ: process 100 files cùng lúc, send 50 emails, validate array of records. MaxConcurrency=0 = unlimited parallel; MaxConcurrency=1 = sequential. Kết quả trả về array tương ứng. Giảm thời gian xử lý batch đáng kể so với sequential Lambda.

  5. Khi nào nên dùng Step Functions thay Lambda chaining?

    Xem đáp án

    Dùng Step Functions khi: (1) Workflow phức tạp với nhiều states, branches, error handling — Lambda chaining code trở nên unmanageable, (2) Cần visibility/audit trail — xem execution history, debug failed step, (3) Cần retry logic với exponential backoff per step, (4) Workflow dài hơn 15 phút (Lambda limit), (5) Cần human approval step (.waitForTaskToken), (6) Cần parallel + map processing. Lambda chaining phù hợp cho simple 2-3 step linear flows.

Bài tập thực hành

  • Tạo Standard workflow: 3-step Lambda chain
  • Add Retry + Catch cho 1 task
  • Tạo Parallel state chạy 2 Lambdas đồng thời
  • Test Map state với 10 items
  • Setup .waitForTaskToken: SNS + Lambda callback
  • Tạo Express workflow, test high-frequency execution

Tài liệu tham khảo chính thức


Tiếp theo: API Gateway