Note: The section Wrapping Up links to the complete project codebase. You can always return to this guide for a more detailed walkthrough of the project design and infrastructure.
Overview
In this post, we will create a Python 3 based daily scraper designed to gather key performance indicators and metrics for actively listed Exchange-Traded Funds (ETFs) using the Alpha Vantage and Yahoo Finance APIs. The scraper leverages several AWS services to ensure seamless and automated data collection and storage.
At a high-level, the scraper involves the following resources and tools:
AWS Services and Tools
- AWS CloudFormation: Automates and templatizes the provisioning of AWS resources.
- Terraform: An alternative to CloudFormation for infrastructure as code.
- Amazon VPC: Isolates the compute resources within a logically virtual network.
- AWS Fargate: Runs the containerized scraping application code.
- AWS Lambda: Triggers the Fargate task to run the scraper.
- Amazon EventBridge: Schedules the daily execution of the Lambda function.
- Amazon ECR: Stores Docker images used by the AWS Fargate tasks.
- Amazon S3: Stores the scraped ETF data as well as the Lambda function source code.
- Amazon IAM: Creates roles and policies for AWS principals to interact with each other.
- Amazon CloudWatch: Logging for Lambda and Fargate tasks.
Development and Deployment Tools
- Poetry: Manages the dependencies of the project.
- Docker: Containerizes the application code.
- Boto3: AWS SDK for Python to interact with AWS services.
- GitHub Actions: Automates the deployment processes to ECR and Lambda directly from the GitHub repository.
API Setup
The Alpha Vantage API offers a wide range of financial data, including stock time series, technical and economic indicators, and intelligence capabilities. To access the hosted endpoints, we need to claim a free API key from Alpha Vantage’s website. This key will be used as an environment variable in our scraper code.
The Yahoo Finance API, accessed via the yfinance Python package, provides a simple interface to obtain key performance indicators and metrics for ETFs. The package is not an official Yahoo Finance API but is widely used for financial data extraction.
Important Considerations
Alpha Vantage API: The free tier allows up to 25 requests per day. More details can be found in the support section of Alpha Vantage’s website. In this project, we will use the Listing & Delisting Status endpoint, which returns a list of active or delisted US stocks and ETFs as of the latest trading day.
Yahoo Finance API: There are no officially documented usage limits (that I am aware of). However, to avoid triggering Yahoo’s blocker, the package author recommends respecting the rate-limiter as documented in the Smarter Scraping section of the readme.
For this project, we will limit our requests to 60 per minute, which is sufficient to gather data for thousands of ETFs within a reasonable time frame ~ 1 hour. Other strategies can be fine-tuned based on the specific requirements of the project.
Infrastructure Setup
To keep the resource creation process organized, we will use CloudFormation yaml templates or Terraform modules to break the AWS resources into manageable components that can be easily deployed and torn down as logical units. The following diagram depicts the entire infrastructure for the ETF KPIs scraper on AWS cloud:
Virtual Private Cloud
AWS Fargate requires a virtual private cloud to function properly. Every AWS account comes with a default VPC with a public subnet in each availability zone. For this project, we create a new VPC using either the private subnets template (link) or the public subnets template (link).
Public Subnets
In this setup, the AWS Fargate task that runs the containerized application is placed in public subnets, which allows the application code to directly access the internet via the internet gateway. The table below summarizes the key components from the template:
Component | Description |
---|---|
VPC | A virtual private network with a CIDR block of 10.0.0.0/16 , providing 65,536 IP addresses. |
Internet Gateway | Connects the VPC to the internet, defined by the InternetGateway and AttachGateway resources. Each VPC can have only one internet gateway. |
Public Subnets | Two subnets (10.0.3.0/24 and 10.0.4.0/24 ) with public IP addresses across two availability zones. These subnets are routable to the internet through the internet gateway. |
Route Table | RouteTable handles routes for public subnets to the internet gateway. |
Security Groups | The SecurityGroup resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic. |
The diagram below depicts the infrastructure in the us-east-1
region:
Private Subnets
In this stack, the AWS Fargate task is placed in private subnets. The table below summarizes the key components from the template:
Component | Description |
---|---|
VPC | A virtual private network with a CIDR block of 10.0.0.0/16 , providing 65,536 IP addresses. |
Internet Gateway | Connects the VPC to the internet, defined by the InternetGateway and AttachGateway resources. Each VPC can have only one internet gateway. |
Public Subnets | Two subnets (10.0.1.0/24 and 10.0.2.0/24 ) with public IP addresses across two availability zones. |
Private Subnets | Two subnets (10.0.3.0/24 and 10.0.4.0/24 ) without public IP addresses across two availability zones. |
NAT Gateways | Located in the public subnets, NAT gateways (NATGateway1 in PublicSubnet1 and NATGateway2 in PublicSubnet2 ) allow instances in private subnets to connect to the internet. Each NAT gateway is in a different availability zone to ensure robust internet access, even in the event of an outage in an availability zone. |
Elastic IPs | Public IP addresses associated with the NAT gateways. Each NAT gateway must have an Elastic IP for internet connectivity. |
Route Tables | RouteTablePublic handles routes for public subnets to the internet gateway, while RouteTablePrivate1 and RouteTablePrivate2 manage routes for private subnets to NAT gateways. |
Security Groups | The SecurityGroup resource for this project allows all outbound traffic for API calls but does not allow any inbound traffic. |
This infrastructure can be visualized as follows:
Which Subnet Setup Should We Choose?
When deciding between public and private subnets, consider the following definitions from the official documentation:
- Public Subnet: The subnet has a direct route to an internet gateway. Resources in a public subnet can access the public internet.
- Private Subnet: The subnet does not have a direct route to an internet gateway. Resources in a private subnet require a NAT device to access the public internet.
As we shall see, the ETF KPIs scraper code only needs outbound traffic to make API calls, and no inbound traffic is expected. Therefore, the two setups are functionally equivalent for this project. Regardless of whether the AWS Fargate task is deployed in public subnets or private subnets, we can specify a security group with rules that prevent any inbound traffic and allows only outbound traffic.
Still, while the public subnets setup may seem simpler, it increases the risk of misconfiguration at the security group level if any inbound traffic is in fact required. The best practice, in general, is to deploy in private subnets and use NAT gateways to access the internet.
S3 & Elastic Container Registry
Two critical resources are:
- S3 Bucket: Stores the ETF data scraped by the application, the packaged Lambda function source code, and the environment file for the Fargate task.
- ECR Repository: Stores the Docker image used by the AWS Fargate task to run the scraper.
The CloudFormation template (link) is parameterized with the following inputs from the user:
S3BucketName
(String): The name of the S3 bucket to be created.ECRRepoName
(String): The name of the ECR repository to be created.
IAM Roles and Policies
To run this ETF data scraper, we need to set up various IAM roles and policies to give principals (i.e., AWS services like ECS and Lambda) permissions to interact with each other. In addition, we need to create a role with the necessary permissions for workflows to automate the deployment tasks. The following CloudFormation template (link) defines these roles and policies for Lambda, ECS, and GitHub Action workflows.
The template requires the following parameters:
S3BucketName
(String): The name of the S3 bucket created earlier.ECRRepoName
(String): The name of the ECR repository created earlier.ECRRepoArn
(String): The ARN of the ECR repository created earlier.GithubUsername
(String): The GitHub username.GithubRepoName
(String): The GitHub repository name.
The last two parameters ensure that only GitHub Actions from the specified repository (and the main
branch) can assume the role with permissions to update the Lambda function code and push Docker images to ECR.
Compared to using an IAM user with long-term credentials stored as repository secrets, creating roles assumable by workflows with short-term credentials is a more secure method. This is the recommended approach by AWS for automating deployment tasks. To learning more about this approach, consider exploring the following resources:
Lambda Execution Role
The Lambda execution role allows Lambda to interact with other AWS services.
- Role Name:
${AWS::StackName}_lambda_execution_role
- Policies:
- LambdaLogPolicy: Allows Lambda to write logs to CloudWatch.
- LambdaECSPolicy: Allows Lambda to run ECS tasks.
- LambdaIAMPolicy: Allows Lambda to pass the ECS execution role and task role to ECS; this policy is useful for restricting the Lambda function to only pass specified roles to ECS.
ECS Execution Role
The ECS execution role allows ECS to interact with other AWS services.
- Role Name:
${AWS::StackName}_ecs_execution_role
- Policies:
- ECSExecutionPolicy: Allows ECS to log into and pull images from ECR, write logs to CloudWatch, get environment files from S3.
ECS Task Role
The ECS task role allows the Fargate task to interact with S3, enabling the application code to upload the scraped data. The task role should contain all permissions required by the application code running in the container. It is separate from the ECS execution role, which is used by ECS to manage the task and not by the task itself.
- Role Name:
${AWS::StackName}_ecs_task_role
- Policies:
- ECSTaskPolicy: Allows the Fargate task to upload and get objects from S3.
GitHub Actions Role
We enable workflows to authenticate with AWS through Github’s OIDC provider, facilitating secure and direct interactions with AWS services without needing to store long-term credentials as secrets.
- Role Name:
${AWS::StackName}_github_actions_role
- Trust Relationship:
- Establish a trust relationship with GitHub’s OIDC provider, allowing it to assume this role when authenticated via OIDC.
- Access is restricted to actions triggered by push to the
main
branch of the specified GitHub repository, ensuring that only authorized code changes can initiate AWS actions.
- Policies:
- GithubActionsPolicy: Allows the workflows that assume this role to update the Lambda function, push Docker images to ECR, and interact with S3.
Outputs
The template outputs the ARNs of the roles and the secrets for the GitHub Actions user, which can then be accessed from the console.
- LambdaExecutionRoleArn: ARN of the Lambda execution role.
- ECSExecutionRoleArn: ARN of the ECS execution role.
- ECSTaskRoleArn: ARN of the ECS task role.
- GithubActionsRoleArn: ARN of the GitHub Actions role.
AWS Fargate
The CloudFormation template (link) for AWS Fargate requires the following parameters:
The IAM role ARNs:
ECSExecutionRoleArn
(String): The ARN of the ECS execution role exported from the IAM template.ECSTaskRoleArn
(String): The ARN of the ECS task role exported from the IAM template.
The Task Definition Parameters:
CpuArchitecture
(String): The CPU architecture of the task. Default isX86_64
. Important: Ensure this is compatible with the architecture for which the Docker image is built if multi-platform builds are not used.OperatingSystemFamily
(String): The operating system family of the task. Default isLINUX
.Cpu
(Number): The hard limit of CPU units for the task. Default is1024
(i.e., 1 vCPUs).Memory
(Number): The hard limit of memory (in MiB) to reserve for the container. Default is2048
(i.e., 2 GB).SizeInGiB
(Number): The amount of ephemeral storage (in GiB) to reserve for the container. Default is21
.
Other parameters:
EnvironmentFileS3Arn
(String): The S3 ARN of the environment file for the container. This file contains the environment variables required by the application code. More details on the environment file are in the Application Code section below.ECRRepoName
(String): The name of the ECR repository created earlier.
Cluster
An AWS Fargate task is typically run in a cluster, which is a logical grouping of tasks. The template linked above creates an ECS cluster with the following properties:
- ClusterSettings: Enables container insights for the cluster, which automatically collects usage metrics for CPU, memory, disk, and network.
- CapacityProviders: Specifies
FARGATE
andFARGATE_SPOT
(i.e., interruption tolerant tasks at discounted rate relative to on-demand) as capacity providers to optimize cost and availability. - DefaultCapacityProviderStrategy: Distributes tasks evenly between
FARGATE
andFARGATE_SPOT
. - Configuration: Enables
ExecuteCommandConfiguration
withDEFAULT
logging usingawslogs
, which uses the logging configurations defined in the container definition.
Task & Container Definitions
The task definition specifies the IAM roles and compute resources for the task, while the container definition specifies the Docker image and environment variable file locations, and logging configuration for the container.
Important: AWS Fargate requires the awsvpc
network mode, providing each task with an elastic network interface. This ensures that each task has its own network interface, improving isolation and security. In our Lambda function code (link), we use the boto3
library to run the AWS Fargate task, specifying the subnets and security group to attach to the network interface. In addition, ensure that awslogs-create-group: "true"
options is set in the container definition to create a log group for the container.
Outputs
The template outputs three values:
- ECSFargateClusterName: The name of the ECS cluster.
- ECSFargateTaskDefinitionFamily: The name of the task definition family.
- ECSFargateContainerName: The name of the container within the task definition.
All the above will be used as environment variables in the Lambda function to properly trigger the AWS Fargate task.
Lambda & EventBridge
The last CloudFormation template (link) sets up an AWS Lambda function and an Amazon EventBridge rule to automate the execution of our ETF KPIs scraper.
The template requires the following parameters:
S3BucketName
(String): The name of the S3 bucket where the Lambda function code is stored.EventBridgeScheduleExpression
(String): The schedule expression for the EventBridge rule (e.g.,rate(1 day)
), which defines how frequently the Lambda function is triggered.LambdaExecutionRoleArn
(String): The ARN of the Lambda execution role, which grants the Lambda function the necessary permissions to interact with other AWS services.Architectures
(String): The architecture of the Lambda function (x86_64
orarm64
). Default isx86_64
.Runtime (String)
: The runtime environment for the Lambda function. Default ispython3.11
.Timeout
(Number): The timeout duration for the Lambda function in seconds. Default is30
seconds. Since the Lambda function simply triggers the AWS Fargate task and does not perform the scraping itself, the timeout can be set to a lower value.
Lambda Function
The Lambda function is responsible for invoking the AWS Fargate task, via boto3
, which runs the ETF KPIs scraper application code. The following properties are important to note:
- Handler: The handler method within the Lambda function’s code. For this project, it is
lambda_function.lambda_handler
. In general, it should match the file name and the method name in the source code. - Runtime: The runtime environment for the Lambda function, set to
python3.11
to match with the python version specified in pyproject.toml. - Code: The location in S3 where the Lambda function’s deployment package (ZIP file) is stored.
EventBridge Rule
The EventBridge rule triggers the Lambda function on a predefined schedule. For this project, the expression is set to cron(00 22 ? * MON-FRI *)
, which triggers the Lambda function at 10 PM UTC time (i.e., 4PM EST or 5PM CST) from Monday to Friday after the market closes.
EventBridge allows us to create a server-less, time-based trigger for our Lambda function. This means we can automate the scraping task to run at regular intervals (e.g., daily), ensuring timely data collection without manual scheduling.
Important: To allows EventBridge to invoke the Lambda function, we need to grant this principal the necessary permissions to invoke the function.
Cost Considerations
The cost of running this project on AWS will depend on the frequency of data collection, the number of ETFs scraped, and the AWS services usage and infrastructure decisions.
All estimates below are generated using the AWS Pricing Calculator.
Nevertheless, to optimize costs and ensure efficient resource usage, it’s essential to fine-tune the resources to match the actual requirements of the project. The Container Insights feature in CloudWatch is helpful for monitoring the performance of the Fargate tasks. This feature is already enabled in the ECS Cluster template, allowing us to track metrics such as CPU and memory usage, network traffic, and task health.
Potentially Non-Negligible Costs
VPC
NAT Gateway:
- Gateway usage: \(730\) hours/month x \(0.045 = \$32.85\)
- Data processing: \(3\) GB/month x \(0.045 = \$0.14\) (This may vary depending on the data processed)
- Total NAT Gateway cost: \(\$32.99\)/month
Note: This cost may be avoided by using a public subnet setup or VPC endpoints.
Public IPv4 Address:
- \(1\) address x \(730\) hours/month x \(0.005 = \$3.65\)
- Total Public IPv4 Address cost: \(\$3.65\)/month
Potentially Negligible Costs & Free Tier
Fargate
Assuming \(21\) trading days per month, the cost of running the Fargate task for \(21\) days with the following resources:
- vCPU hours:
- \(21\) tasks x \(1\) vCPU x \(0.67\) hours x \(0.04048\)/hour = \(\$0.57\)
- GB hours:
- \(21\) tasks x \(2.00\) GB x \(0.67\) hours x \(0.004445\)/GB/hour = \(\$0.13\)
- Ephemeral storage:
- \(20\) GB (no additional charge)
- Total Fargate cost: \(\$0.70\)/month
Lambda
Assuming the Lambda function is triggered \(21\) times per month, the cost of running the Lambda function falls within the free tier:
- Memory allocation:
- \(128\) MB (\(0.125\) GB)
- Ephemeral storage:
- \(512\) MB (\(0.5\) GB)
- Compute time:
- \(21\) requests x \(2,000\) ms = \(42\) seconds (\(5.25\) GBs)
- Free tier:
- \(400,000\) GB-s and \(1,000,000\) requests
- Billable GB-s and requests:
- \(0\) GB-s, \(0\) requests
- Total Lambda cost: \(\$0.00\)/month
S3 and ECR
For this project, the size of the scraped data are in the thousands and the size of the files are in order of KBs. The cost of storing the data in S3 per month is negligible:
- Storage:
- \(1\) GB x \(0.023 = \$0.02\)
- Total S3 cost: \(\$0.02\)/month
Similarly, the size of the docker image for this project is \(\sim 142\) MB:
- Storage:
- \(142.16\) MB/month x \(0.0009765625\) GB/MB = \(0.138828125\) GB/month
- Total ECR cost: \(\$0.0139\)/month
EventBridge Scheduler
- Invocations:
- \(21\) invocations (first \(14,000,000\) free)
- Total EventBridge cost: \(\$0.00\)/month
CloudWatch
Assuming we are only storing logs for the Lambda function and Fargate task, the cost of CloudWatch is very much negligible:
- Total CloudWatch cost: \(\$0.00\)/month
Total Estimated Monthly Cost
Service | Monthly Cost |
---|---|
NAT Gateway | \(\$32.99\)/month |
Public IPv4 Address | \(\$3.65\)/month |
Fargate | \(\$0.70\)/month |
Lambda | \(\$0.00\)/month |
S3 | \(\$0.02\)/month |
ECR | \(\$0.0139\)/month |
EventBridge Scheduler | \(\$0.00\)/month |
CloudWatch | \(\$0.00\)/month |
Total | \(\$37.3739\)/month |
This breakdown provides an estimate of the monthly costs associated with the project. The biggest cost driver is the NAT Gateway, which can be avoided by using a public subnet setup. The costs of Fargate, Lambda, S3, ECR, EventBridge, and CloudWatch are all within the free tier or negligible for this project.
Github Action Workflows
To automate the deployment processes, we use two workflows:
ecr_deployment.yaml
(link): Building and pushing the Docker image to Amazon ECR.lambda_deployment.yaml
(link): Zipping the lambda source code to s3 and reflecting the changes in the lambda function via anupdate
operation.
The workflows require the following GitHub secrets:
- AWS_GITHUB_ACTIONS_ROLE_ARN: The ARN of the GitHub Actions role created in the IAM template.
- AWS_REGION: The AWS region where the resources are deployed.
- ECR_REPOSITORY: The name of the ECR repository for storing Docker images (for
ecr_deployment.yaml
). - S3_BUCKET: The name of the S3 bucket where the Lambda function code is stored (for
lambda_deployment.yaml
). - LAMBDA_FUNCTION: The name of the Lambda function to update (for
lambda_deployment.yaml
).
Both workflows are triggered on push to the main
branch if certain pre-specified files are modified. The ecr_deployment.yaml
workflow is triggered if any of the following files are modified:
Dockerfile
: Changes to the Dockerfilesrc/**
: Changes to the source code!src/deploy_stack.py
: Exclude changes to the deployment script since it is not part of the Docker imagemain.py
: Changes to the entry point of the applicationpyproject.toml
&poetry.lock
: Changes to dependencies
The lambda_deployment.yaml
workflow is triggered only if the lambda_function.py
file is modified.
Application Code
The application code consists of the following files:
└── src
├── __init__.py
├── api.py
└── utils.py
├── main.py
Modules
The main.py
script (link) serves as the entry point for the scraper application. Its primary function is to orchestrate the overall scraping process, which includes querying ETF data from external APIs, processing the data, and writing the results to an S3 bucket. The following environment variables are required:
API_KEY
(String): The Alpha Vantage API key.S3_BUCKET
(String): The name of the S3 bucket where the scraped data should be stored.IPO_DATE
(String): The cutoff date for the IPO status check. The scraper will only fetch data for ETFs that were listed on or after this date. The format isYYYY-MM-DD
.MAX_ETFS
(Int): The maximum number of ETFs to scrape. This can be set to a lower value for testing purposes.PARQUET
(String): Whether to save the scraped data as Parquet files. If set to a string valueTrue
, the data is saved in Parquet format; otherwise, it is saved as a CSV flat file.
In addition, the following environment variable controls the execution of the scraper:
ENV
(String): The environment in which the scraper is running. Set todev
by default inmain.py
and can be overriden toprod
when the Lambda function is triggered.
Important: Rather than creating separate
dev
andprod
environments with distinct resources (e.g., S3 buckets, ECR repositories, Lambda functions), this approach uses a single environment. Differentiation is achieved by setting theenv
variable todev
within the Lambda environment variables and configuring theecs.run_task
method inlambda_function.py
with"environment": [{"name": "ENV", "value": env}]
. This can be overridden by passsingevent = {"env": "dev"}
to thelambda_handler
function inlambda_function.py
.
The api.py
module (link) contains the query_etf_data
function, which is responsible for fetching ETF data from the Alpha Vantage and Yahoo Finance APIs.
Key Performance Indicators
The key performance indicators (KPIs) and metrics fetched and processed by the scraper include:
- Previous Close: The last closing price of the ETF, useful for understanding recent prices of the ETF.
- NAV Price: Net Asset Value price, which represents the value of each share’s portion of the fund’s underlying assets and cash at the end of the trading day.
- Trailing P/E: Trailing price-to-earnings ratio, indicating the ETF’s valuation relative to its earnings.
- Volume: The total number of shares traded during the last trading day.
- Average Volume: The average number of shares traded over a specified period, e.g., 30 days.
- Bid and Ask Prices: The highest price a buyer is willing to pay (bid) and the lowest price a seller is willing to accept (ask), along with their respective sizes. More details on bid size can be found here.
- Category: The classification of the ETF, providing context on the type of assets it holds.
- Beta (Three-Year): A volatility measure of the ETF relative to the market, typically proxied by the S&P 500, over the past three years.
- YTD Return: Year-to-date return, measuring the ETF’s performance since the first trading day of the current calendar year.
- Three-Year and Five-Year Average Returns: The average returns over the past three and five years, respectively, providing long-term performance insights. These are derived from the compound annual growth rate formula.
\[\begin{align*} \text{CAGR} = \left( \frac{\text{Ending Value}}{\text{Beginning Value}} \right)^{\frac{1}{n}} - 1 \end{align*}\]
DockerFile
The application code is containerized using Docker to be deployed on AWS cloud. The following Dockerfile
(link) takes a multi-stage approach to build the image efficiently.
- Base Stage (
python-base
):- Sets up a lightweight Python environment using a base image.
- Sets essential environment variables for optimized performance and dependency management.
- Builder Stage (
builder
):- Installs system dependencies and Poetry.
- Copies the
pyproject.toml
andpoetry.lock
files onto the container and installs the project dependencies in a virtual environment.
- Production Stage (
production
):- Copies the project directory with dependencies from the builder stage.
- Copies the application code onto the container.
- Sets the working directory and specifies the command to run the application using the Python interpreter from the virtual environment created during the builder stage.
This Dockerfile
is adapted from a discussion in the Poetry GitHub repository.
Deployment
There are two options for deploying the aws resources:
- Using the AWS Console:
- Use an IAM entity with the necessary permissions, e.g., a user with administrator access.
- This is straightforward as everything is accomplished via a graphical user interface.
- Using Terraform:
- The Terraform modules (link) can be used to deploy the resources via the Terraform CLI (link).
- The AWS CLI must be installed (link) and configured as documented (link).
- The default
profile
in the configured credentials file must correspond to an IAM user with either administrator access or a more granular permissions set. - In this project, the
admin
profile is referenced as an example. This can be changed to the desired profile name. See the documentation for setting up single sign-on (SSO) profiles with the AWS CLI.
Steps to Deploy Via the AWS Console
When creating the stacks using the console, follow the steps in the order specified below:
- VPC Stack: Create the VPC stack.
- S3 & ECR Stack: Create the S3 & ECR stack.
- IAM Stack: Create the IAM stack.
Add Secrets to GitHub: See details on how to add secrets to GitHub.
AWS_GITHUB_ACTIONS_ROLE_ARN
: The ARN of the GitHub Actions role created in the IAM template.AWS_REGION
: The AWS region where the resources are deployed.ECR_REPOSITORY
: The name of the ECR repository created in the S3 & ECR template.S3_BUCKET
: The name of the S3 bucket created in the S3 & ECR template.
Trigger the
ecr_deployment.yaml
Workflow: This workflow builds the Docker image and pushes it to ECR.Run the
scripts/upload_env_to_s3.sh
(link) andscripts/zip_lambda_to_s3.sh
(link) Scripts: These scripts upload the.env
environment file and the Lambda function code to the S3 bucket we created, respectively. If the AWS CLI is not installed, manual upload to the S3 bucket using the web UI also works. Add execute permissions to the scripts usingchmod +x scripts/upload_env_to_s3.sh scripts/zip_lambda_to_s3.sh
.
- AWS Fargate Stack: Create the AWS Fargate stack, which has dependencies on the ECS execution role, ECS task role, ECR repository name, and the ARN of the environment file in S3 from previous steps.
Lambda & EventBridge Stack: Create the Lambda & EventBridge stack.
Add One More Secrets to GitHub:
LAMBDA_FUNCTION
: This is the name of the lambda function created in the previous step.
Add Environment Variables to Lambda: See details on configuring environment variables in Lambda. The environment variables required by the Lambda function are:
ASSIGN_PUBLIC_IP
: Set toENABLED
when using public subnets andDISABLED
when using private subnets.ECS_CLUSTER_NAME
: Output from the AWS Fargate stack.ECS_CONTAINER_NAME
: Output from the AWS Fargate stack.ECS_TASK_DEFINITION
: Output from the AWS Fargate stack.SECURITY_GROUP
: Output from the VPC stack.SUBNET_1
: Output from the VPC stack.SUBNET_2
: Output from the VPC stack.env
: Set toprod
, but can be overridden todev
by passing{'env': 'dev'}
as the testevent
.
Steps to Deploy Via Terraform
Install AWS CLI
The AWS command line interface is a tool for managing AWS services from the command line. Follow the installation instruction here to install the tool for any given operating system. Verify the installation by running the following commands:
$ which aws
$ aws --version
There are several ways to configure the AWS CLI and credentials in an enterprise setting to enhance security.
As of 2023, AWS recommends managing access centrally using the IAM Identity Center. While it is still possible to manage access using traditional IAM methods (i.e., with long-term credentials), current AWS documentation encourages transitioning to IAM Identity Center for improved security and efficiency.
The steps in this guide are applicable regardless of whether we are using the traditional IAM method or the IAM Identity Center. As long as we have a user— either IAM or IAM Identity Center-based— with the necessary permissions, the outlined steps can be followed.
For simplicity, though it violates the principle of least privilege, all resources can be provisioned using an administrator-level user. However, it’s important to remain vigilant about IAM and resource access management best practices, particularly in enterprise environments where security and access control are critical.
Terraform Modules
The Terraform modules (link) are organized as follows:
terraform
├── ecs_fargate
│ ├── backend.hcl
│ ├── ecs_fargate.tf
│ ├── main.tf
│ ├── outputs.tf
│ ├── variables.tf
│ └── variables.tfvars
├── iam
│ ├── backend.hcl
│ ├── iam.tf
│ ├── main.tf
│ ├── outputs.tf
│ ├── variables.tf
│ └── variables.tfvars
├── lambda_eventbridge
│ ├── backend.hcl
│ ├── lambda_eventbridge.tf
│ ├── main.tf
│ ├── variables.tf
│ └── variables.tfvars
├── s3_ecr
│ ├── backend.hcl
│ ├── ecr.tf
│ ├── main.tf
│ ├── outputs.tf
│ ├── s3.tf
│ ├── variables.tf
│ └── variables.tfvars
├── vpc_private
│ ├── backend.hcl
│ ├── main.tf
│ ├── outputs.tf
│ ├── variables.tf
│ ├── variables.tfvars
│ └── vpc.tf
└── vpc_public
├── backend.hcl
├── main.tf
├── outputs.tf
├── variables.tf
├── variables.tfvars
└── vpc.tf
The order to deploying the resources are the same as those outlined in the AWS Console section above.
VPC: No dependencies on other modules.
Output:
- Public or private subnet IDs
- Security group ID
S3 & ECR: No dependencies on other modules.
- Build and push the Docker image to ECR.
- Manually upload the environment file and Lambda function code to S3, or use shell scripts to automate this step. This must be done initially and only once, before triggering the GitHub Actions workflows.
Output:
- ECR repository name and ARN
- S3 bucket name and ARN
IAM: Depends on the S3 & ECR module outputs.
- Add the following secrets to GitHub:
AWS_GITHUB_ACTIONS_ROLE_ARN
,AWS_REGION
,ECR_REPOSITORY
,S3_BUCKET
. - All subsequent deployments of the Lambda code and ECR images can be automated via the GitHub Actions workflows.
Output:
- Lambda execution role ARN
- ECS execution role ARN
- ECS task role ARN
- GitHub Actions role ARN
- Add the following secrets to GitHub:
AWS Fargate: Depends on the IAM module and S3 & ECR module outputs.
Output:
- ECS Fargate cluster name
- ECS Fargate task definition family
- ECS Fargate container name
Lambda & EventBridge: Depends on all (IAM, S3 & ECR, VPC, AWS Fargate) previous module outputs.
- Add the following secrets to GitHub:
LAMBDA_FUNCTION
- Add the following secrets to GitHub:
Configuration Files
main.tf
The main.tf
files in each module define the main set of configurations. In this project, they specify the provider, backend, and data sources required for each deployment.
The
terraform_remote_state
data source retrieves root module output values from a previous deployment’s state file stored on S3. For example, theiam
module uses outputs like ECR and S3 ARNs from thes3_ecr
module to configure IAM roles and policies for the Lambda function and Fargate task.
variables.tf
& variables.tfvars
These are the variable declaration and definition files, respectively. They specify the input variables required for the deployment and their default values. The variables.tfvars
files contain the actual values for the variables. Example variables.tfvars.examples
files are provided in each module directory.
outputs.tf
The outputs.tf
files define the output values that are exported from the module. These outputs can be referenced in other modules or used to configure resources outside of Terraform.
backend.hcl
The backend.hcl
files defines the Terraform backend configuration to store state files in an S3 bucket created and managed separately from this project. This separation ensures state files are stored in a single location isolated from the resources they track, maintaining clear boundaries between resource provisioning (e.g., S3 bucket, IAM roles/policies) and state management.
+--------------------------+ +--------------------------+
| State Management | | Terraform Managed |
| Bucket | | Resources |
| | | |
| +----------------------+ | | +----------------------+ |
| | terraform/ | | | | VPC | |
| | ecs_fargate.tfstate | | | | (network resources) | |
| +----------------------+ | | +----------------------+ |
| | terraform/ | | | +----------------------+ |
| | iam.tfstate | | | | IAM | |
| +----------------------+ | | | (roles, policies) | |
| | terraform/ | | | +----------------------+ |
| | lambda_eventbridge. | | | +----------------------+ |
| | tfstate | | | | Lambda & EventBridge | |
| +----------------------+ | | | (apps and rules) | |
| | terraform/ | | | +----------------------+ |
| | s3_ecr.tfstate | | | +----------------------+ |
| +----------------------+ | | | S3/ECR | |
| | terraform/ | | | | (storage resources) | |
| | vpc.tfstate | | | +----------------------+ |
| +----------------------+ | | +----------------------+ |
| | terraform/ | | | | AWS Fargate | |
| | ecs_fargate.tfstate | | | | (compute resources) | |
| +----------------------+ | | +----------------------+ |
+--------------------------+ +--------------------------+
Deploying the Terraform Modules
Initialize the Terraform Modules:
- Navigate to the module directory, e.g.,
cd terraform/vpc_public
. - Run
terraform init -backend-config=backend.hcl
to initialize the module with the specified backend configuration. - Validate or format the configuration files using
terraform validate
orterraform fmt
.
- Navigate to the module directory, e.g.,
Plan the Deployment:
- Run
terraform plan -var-file=variables.tfvars
to preview the changes that Terraform will make to the infrastructure.
- Run
Apply the Changes:
- Run
terraform apply -var-file=variables.tfvars
to apply the changes and deploy the resources.
- Run
Destroy the Resources:
- To destroy the resources, run
terraform destroy -var-file=variables.tfvars
. This will remove all resources created by Terraform.
- To destroy the resources, run
Test Trigger the Lambda Function
To test the Lambda function, we can manually trigger it from the AWS console. The logs from the Lambda function and the Fargate task can both be viewed in CloudWatch.
- Configure a test event with a payload consisting of
{'env': 'dev'}
:
- View container logs in CloudWatch:
Wrapping Up
ETFs are a great beginner-friendly investment strategy to build diversified portfolios before gaining the confidence to manage our own portfolios more actively.
By automating the data collection and storage with AWS services and Python, we can ensure up-to-date and accurate information with minimal manual effort. This allows us to focus on analyzing the data and making informed investment decisions.
Finally, all source files are available in the following repository.