How to provision an ECS cluster and deploy a webapp on it with load-balanced Docker containers, using Ansible

I wrote a suite of Ansible playbooks to provision an ECS (Elastic Container Service) cluster on AWS, running a webapp deployed on Docker containers in the cluster and load balanced from an ALB (Application Load Balancer), with the Docker image for the app pulled from an ECR (Elastic Container Registry) repository.

This is a follow-up to my project/article “How to use Ansible to provision an EC2 instance with an app running in a Docker container” which explains how to get a containerised Docker app running on an regular EC2 instance, using Docker Hub as the image repo. That could work well as a simple Staging environment, but for Production it’s desirable to easily cluster and scale the containers with a load balancer, so I came up with this solution for provisioning/deploying on ECS which is well-suited for this kind of flexibility. (To quote AWS: “Amazon ECS is a fully managed container orchestration service that makes it easy for you to deploy, manage, and scale containerized applications”.) This solution also uses Amazon’s own ECR for Docker images, rather than Docker Hub.

Overview

Firstly a Docker image is built locally and pushed to a private ECR repository, then the EC2 SSH key and Security Groups are created. Next, a Target Group and corresponding ALB (Application Load Balancer type of ELB) are provisioned, then an ECS container instance is launched on EC2 for the ECS cluster. Finally the ECS cluster is provisioned, an ECS task definition is created to pull and launch the containers from the Docker image in ECR, and finally an ECS Service is provisioned to run the webapp task on the cluster as per the Service definition.

This is an Ansible framework to serve as a basis for building Docker images for your webapp and deploying them as containers on Amazon ECS. It can be expanded in multiple ways, the most obvious being to increase the number of running containers and ECS instances, either with manual scaling or ideally by adding auto-scaling. (Have a look at my article “How to use Ansible for automated AWS provisioning” to see how to auto-scale EC2 instances with Ansible. To scale the containers, it’s simply a case of increasing the desired container count in the ECS Service definition, the rest is handled automatically via port dynamic port mappings. See the provision_production.yml playbook below to learn more.)

CentOS 7 is used for the Docker container, but this can be changed to a different Linux distro if desired. Amazon Linux 2 is used for the ECS cluster instances on EC2.

I created a very basic Python webapp to use as an example for the deployment here, but you can replace that with your own webapp should you so wish.

N.B. Until you’ve tested this and honed it to your needs, run it in a completely separate environment for safety reasons, otherwise there is potential here for accidental destruction of parts of existing environments. Create a separate VPC specifically for this, or even use an entirely separate AWS account.

GitHub files

The playbooks and supporting files can be found in this repository on my GitHub.

Installation/setup

  1. You’ll need an AWS account with a VPC set up, and with a DNS domain set up in Route 53.
  2. Install and configure the latest version of the AWS CLI. The settings in the AWS CLI configuration files are needed by the Ansible modules in these playbooks. Also, the Ansible AWS modules aren’t perfect, so there are a few tasks which needs to run the AWS CLI as a local external command. If you’re using a Mac, I’d recommend using Homebrew as the simplest way of installing and managing the AWS CLI.
  3. If you don’t already have it, you’ll need Python 3. You’ll also need the boto and boto3 Python modules (for Ansible modules and dynamic inventory) which can be installed via pip.
  4. Ansible needs to be installed and configured. Again, if you’re on a Mac, using Homebrew for this is probably best.
  5. Docker needs to be installed and running. For this it’s probably best to refer to the instructions on the Docker website.
  6. ECR Docker Credential Helper needs to be installed so that the local Docker daemon can authenticate with Elastic Container Registry in order to push images to a repository there. Follow the link for installation instructions (on a Mac, as usual, I’d recommend the Homebrew method).
  7. Copy etc/variables_template.yml to etc/variables.yml and update the static variables at the top for your own environment setup.

Configuring ECR Docker Credential Helper

The method which worked best for me was to add a suitable “credHelpers” section to my ~/.docker/config.json file:

"credHelpers": {
    "000000000000.dkr.ecr.eu-west-2.amazonaws.com": "ecr-login"
}

(I’ve replaced my AWS account ID with zeros, but otherwise this is correct.)

So, for me, the whole ~/.docker/config.json ended up looking like this. Yours may not be quite the same but hopefully it clarifies how to add the “credHelpers” section near the end:

{
    "auths": {
        "000000000000.dkr.ecr.eu-west-2.amazonaws.com": {},
        "https://index.docker.io/v1/": {
            "auth": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
        }
    },
    "credHelpers": {
        "000000000000.dkr.ecr.eu-west-2.amazonaws.com": "ecr-login"
    }
}

Hopefully now if your AWS credentials are also set correctly, you should have no trouble pushing Docker images to ECR repositories.

Usage

These playbooks are run in the standard way, i.e:

ansible-playbook PLAYBOOK_NAME.yml

To deploy your own webapp instead of my basic Python app, you’ll need to modify build_push.yml so it pulls your own app from your repo, then you can edit the variables as needed in etc/variables.yml.

Playbooks for build/provisioning/deployment

There are comments at key points in the playbooks to help further explain certain aspects of what is going on.

1. build_push.yml

Pulls the webapp from GitHub, builds a Docker image using docker/Dockerfile which runs the webapp, and pushes the image to a private ECR repository:

---
- name: Build Docker image and push to ECR repository
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Get app from GitHub
    git:
      repo: "https://github.com/mattbrock/simple-webapp.git"
      dest: "docker/{{ app_name }}"
      force: yes

  - name: Create Amazon ECR repository
    ecs_ecr:
      name: "{{ app_name }}"
    register: ecr_repo

  - name: Update variables file with repo URI
    lineinfile:
      path: etc/variables.yml
      regex: '^ecr_repo:'
      line: "ecr_repo: {{ ecr_repo.repository.repositoryUri }}"

  - name: Build Docker image and push to AWS ECR repository
    docker_image:
      build:
        path: ./docker
      name: "{{ ecr_repo.repository.repositoryUri }}:latest"
      push: yes
      source: build
      force_source: yes

2. provision_key_sg.yml

Provisions an EC2 SSH key, and Security Groups for ECS container instances and ELB:

---
- name: Provision SSH key, Security Groups and Application Load Balancer
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Create EC2 SSH key
    ec2_key:
      name: "{{ app_name }}"
    register: ec2_key

  - name: Save EC2 SSH key to file
    copy:
      content: "{{ ec2_key.key.private_key }}"
      dest: etc/ec2_key.pem
      mode: 0600
    when: ec2_key.changed

  - name: Create Security Group for Application Load Balancer
    ec2_group:
      name: Application Load Balancer
      description: EC2 VPC Security Group for Application Load Balancer
      vpc_id: "{{ vpc_id }}"
      rules:
      - proto: tcp
        ports: 80
        cidr_ip: 0.0.0.0/0
        rule_desc: Allow app access from everywhere
    register: ec2_sg_lb

  - name: Update variables file with Security Group ID
    lineinfile:
      path: etc/variables.yml
      regex: '^ec2_sg_lb_id:'
      line: "ec2_sg_lb_id: {{ ec2_sg_lb.group_id }}"
    when: ec2_sg_lb.changed

  - name: Create Security Group for ECS container instances
    ec2_group:
      name: ECS Container Instances
      description: EC2 VPC Security Group for ECS container instances
      vpc_id: "{{ vpc_id }}"
      rules:
      - proto: tcp
        ports: 0-65535
        group_id: "{{ ec2_sg_lb.group_id }}"
        rule_desc: Allow ELB access to containers
      - proto: tcp
        ports: 8080
        cidr_ip: "{{ my_ip }}/32"
        rule_desc: Allow direct app access from my IP
      - proto: tcp
        ports: 22
        cidr_ip: "{{ my_ip }}/32"
        rule_desc: Allow SSH from my IP
    register: ec2_sg_app

  - name: Update variables file with Security Group ID
    lineinfile:
      path: etc/variables.yml
      regex: '^ec2_sg_app_id:'
      line: "ec2_sg_app_id: {{ ec2_sg_app.group_id }}"
    when: ec2_sg_app.changed

3. provision_production.yml

Provisions a Target Group and associated ALB (Application Load Balancer type of ELB) for load balancing the containers, provisions IAM setup for ECS instances, launches ECS container instance on EC2, provisions ECS cluster, and sets up ECS task definition and Service so the webapp containers deploy on the cluster using the Docker image in ECR:

---
- name: Provision ECS cluster, task definition and service with Docker container, including Target Group + ALB, and ECS container instances
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Create Target Group
    elb_target_group:
      name: "{{ app_name }}"
      protocol: http
      port: 80
      vpc_id: "{{ vpc_id }}"
      state: present
      modify_targets: no
    register: target_group

  - name: Create Application Load Balancer
    elb_application_lb:
      name: "{{ app_name }}"
      security_groups: "{{ ec2_sg_lb_id }}"
      subnets:
      - "{{ vpc_subnet_id_1 }}"
      - "{{ vpc_subnet_id_2 }}"
      listeners:
      - Protocol: HTTP
        Port: 80
        DefaultActions:
        - Type: forward
          TargetGroupName: "{{ app_name }}"
        Rules:
        - Conditions:
          - Field: host-header
            Values:
            - "{{ route53_zone }}"
          Priority: '1'
          Actions:
          - Type: redirect
            RedirectConfig:
              Host: "www.{{ route53_zone }}"
              Protocol: "#{protocol}"
              Port: "#{port}"
              Path: "/#{path}"
              Query: "#{query}"
              StatusCode: "HTTP_301"
    register: load_balancer

  - name: Update variables file with ELB DNS
    lineinfile:
      path: etc/variables.yml
      regex: '^elb_dns:'
      line: "elb_dns: {{ load_balancer.dns_name }}"

  - name: Update variables file with ELB hosted zone ID
    lineinfile:
      path: etc/variables.yml
      regex: '^elb_zone_id:'
      line: "elb_zone_id: {{ load_balancer.canonical_hosted_zone_id }}"

  - name: Create ECS Instance Role for EC2 Production instances
    iam_role:
      name: ecsInstanceRole
      assume_role_policy_document: "{{ lookup('file','etc/ecs_instance_role_policy.json') }}"
      managed_policies: 
      - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
      create_instance_profile: yes
      state: present
    register: ecs_instance_role
    
  # Couldn't find any way with Ansible plugins to link the role to the instance profile
  # so we do it like this. It's messy but it works
  - name: Link ECS Instance Role to Instance Profile
    command: aws iam add-role-to-instance-profile --role-name ecsInstanceRole --instance-profile-name ecsInstanceRole
    ignore_errors: yes

  # Specify the ECS Instance Role and add the User Data so ECS knows
  # to use this instance for the ECS cluster
  - name: Launch an ECS container instance on EC2 for the cluster to run tasks on
    ec2_instance:
      name: ECS
      key_name: "{{ app_name }}"
      vpc_subnet_id: "{{ vpc_subnet_id_1 }}"
      instance_type: t2.micro
      instance_role: "{{ ecs_instance_role.role_name }}"
      security_group: "{{ ec2_sg_app_id }}"
      network:
        assign_public_ip: true
      image_id: "{{ ec2_ecs_image_id }}"
      tags:
        Environment: Production
      user_data: |
        #!/bin/bash
        echo ECS_CLUSTER={{ app_name }} >> /etc/ecs/ecs.config
      wait: yes

  - name: Provision ECS cluster
    ecs_cluster:
      name: "{{ app_name }}"
      state: present

  # Set hostPort to 0 to enable dynamic port mappings from load balancer
  #
  # force_create ensures new revision when app has changed in repo
  # and causes service to redeploy as rolling deployment with new task revision
  - name: Create ECS task definition with dynamic port mappings from load balancer (setting hostPort to 0 to enable this)
    ecs_taskdefinition:
      family: "{{ app_name }}"
      containers:
      - name: "{{ app_name }}"
        image: "{{ ecr_repo }}:latest"
        memory: 128
        portMappings: 
        - containerPort: 8080
          hostPort: 0
      launch_type: EC2
      network_mode: default
      state: present
      force_create: yes

  - name: Pause is necessary before provisioning service, possibly for AWS to finish creating service-linked IAM role for ECS
    pause:
      seconds: 30

  - name: Provision ECS service
    ecs_service:
      name: "{{ app_name }}"
      cluster: "{{ app_name }}"
      task_definition: "{{ app_name }}"
      desired_count: 1
      launch_type: EC2
      scheduling_strategy: REPLICA
      load_balancers:
      - targetGroupArn: "{{ target_group.target_group_arn }}"
        containerName: "{{ app_name }}"
        containerPort: "{{ 8080 }}"
      state: present

4. provision_dns.yml

Provisions the DNS in Route 53 for the ALB; note that it may take a few minutes for the DNS to propagate before it becomes usable:

---
- name: Provision DNS
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Add an alias record for root
    route53:
      state: present
      zone: "{{ route53_zone }}"
      record: "{{ route53_zone }}"
      type: A
      value: "{{ elb_dns }}"
      alias: yes
      alias_hosted_zone_id: "{{ elb_zone_id }}"
      alias_evaluate_target_health: yes
      overwrite: yes

  - name: Add an alias record for www.domain
    route53:
      state: present
      zone: "{{ route53_zone }}"
      record: "www.{{ route53_zone }}"
      type: A
      value: "{{ elb_dns }}"
      alias: yes
      alias_hosted_zone_id: "{{ elb_zone_id }}"
      alias_evaluate_target_health: yes
      overwrite: yes

Running order and outcome

Initially, running later playbooks without having run the earlier ones will fail due to missing components and variables etc. Running all four playbooks in succession will set up the entire infrastructure from start to finish.

Once everything is built successfully, the ECS service will attempt to run a task to deploy the webapp containers in the cluster. Below are instructions for how to check the service event log to see task deployment progress.

Redeployment

Once the environment is up and running, any changes to the app can be rebuilt and redeployed by running Steps 1 and 3 again. This makes use of the rolling deployment mechanism within ECS for a smooth automated transition to the new version of the app.

Playbooks for deprovisioning

1. destroy_all.yml

Destroys the entire AWS infrastructure:

---
- name: Destroy entire infrastructure
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Delete DNS record for root
    route53:
      state: absent
      zone: "{{ route53_zone }}"
      record: "{{ route53_zone }}"
      type: A
      value: "{{ elb_dns }}"
      alias: yes
      alias_hosted_zone_id: "{{ elb_zone_id }}"
      alias_evaluate_target_health: yes

  - name: Delete DNS record for www.
    route53:
      state: absent
      zone: "{{ route53_zone }}"
      record: "www.{{ route53_zone }}"
      type: A
      value: "{{ elb_dns }}"
      alias: yes
      alias_hosted_zone_id: "{{ elb_zone_id }}"
      alias_evaluate_target_health: yes

  - name: Delete ECS service
    ecs_service:
      name: "{{ app_name }}"
      cluster: "{{ app_name }}"
      state: absent
      force_deletion: yes

  # Ansible AWS plugins didn't seem to offer a way of removing all revisions of a task definition
  # so we have to do it like this
  - name: Deregister all ECS task definitions
    shell: for taskdef in $(aws ecs list-task-definitions --query 'taskDefinitionArns[*]' --output text | grep {{ app_name }}) ; do aws ecs deregister-task-definition --task-definition $taskdef ; done

  - name: Delete Application Load Balancer
    elb_application_lb:
      name: "{{ app_name }}"
      state: absent

  - name: Delete Target Group
    elb_target_group:
      name: "{{ app_name }}"
      state: absent

  - name: Terminate all EC2 instances
    ec2_instance:
      state: absent
      filters:
        instance-state-name: running
        tag:Name: ECS
      wait: yes

  - name: Delete ECS cluster
    ecs_cluster:
      name: "{{ app_name }}"
      state: absent

  # Ansible AWS plugins apparently can't force-remove a repository, i.e.
  # remove a repository containing images, so we have to do it like this
  - name: Delete ECR repository
    shell: aws ecr delete-repository --repository-name {{ app_name }} --force
    ignore_errors: yes

  - name: Delete Security Group for ECS container instances
    ec2_group:
      group_id: "{{ ec2_sg_app_id }}"
      state: absent

  - name: Delete Security Group for load balancer
    ec2_group:
      group_id: "{{ ec2_sg_lb_id }}"
      state: absent

  - name: Delete EC2 SSH key
    ec2_key:
      name: "{{ app_name }}"
      state: absent

  - name: Delete ecsInstanceRole
    iam_role:
      name: ecsInstanceRole
      state: absent

  - name: Delete service-linked IAM role for ECS
    command: aws iam delete-service-linked-role --role-name AWSServiceRoleForECS
    ignore_errors: yes

  - name: Delete service-linked IAM role for ELB
    command: aws iam delete-service-linked-role --role-name AWSServiceRoleForElasticLoadBalancing
    ignore_errors: yes

2. delete_all.yml

Clears all dynamic variables in the etc/variables.yml file, deletes the EC2 SSH key, removes the local Docker image, and deletes the local webapp repo in the docker directory:

---
- name: Delete dynamic variables, SSH key file, local Docker image and local app repo
  hosts: localhost
  connection: local
  tasks:

  - name: Import variables
    include_vars: etc/variables.yml

  - name: Remove ELB DNS from variables file
    lineinfile:
      path: etc/variables.yml
      regex: '^elb_dns:'
      line: "elb_dns:"

  - name: Remove ELB Zone ID from variables file
    lineinfile:
      path: etc/variables.yml
      regex: '^elb_zone_id:'
      line: "elb_zone_id:"

  - name: Remove app Security Group from variables file
    lineinfile:
      path: etc/variables.yml
      regex: '^ec2_sg_app_id:'
      line: "ec2_sg_app_id:"

  - name: Remove LB Security Group from variables file
    lineinfile:
      path: etc/variables.yml
      regex: '^ec2_sg_lb_id:'
      line: "ec2_sg_lb_id:"

  - name: Remove ECR repo from variables file
    lineinfile:
      path: etc/variables.yml
      regex: '^ecr_repo:'
      line: "ecr_repo:"

  - name: Delete SSH key file
    file:
      path: etc/ec2_key.pem
      state: absent

  - name: Remove local Docker image
    docker_image:
      name: "{{ ecr_repo }}"
      state: absent
      force_absent: yes

  - name: Delete local app repo folder
    file:
      path: "./docker/{{ app_name }}"
      state: absent

Destruction/deletion notes

USE destroy_all.yml WITH EXTREME CAUTION! If you’re not operating in a completely separate environment, or if your shell is configured for the wrong AWS account, you could potentially cause serious damage with this. Always check before running that you are working in the correct isolated environment and that you are absolutely 100 percent sure you want to do this. Don’t say I didn’t warn you!

Once everything has been fully destroyed, it’s safe to run the delete_all.yml playbook to clear out the variables file. Do not run this until you are sure everything has been fully destroyed, because the SSH key file can never be recovered again after it has been deleted.

Checking the Docker image in a local container

After building the Docker image in Step 1, if you want to run a local container from the image for initial testing purposes, you can use standard Docker commands for this:

docker run -d --name simple-webapp -p 8080:8080 $(grep ecr_repo etc/variables.yml | cut -d" " -f2):latest

You should then be able to make a request to the local container at:

http://localhost:8080/

To check the logs:

docker logs simple-webapp

To stop the container:

docker stop simple-webapp

To remove it:

docker rm simple-webapp

Checking deployment status, logs, etc.

To check the state of the deployment and see events in the service log (change “simple-webapp” to the name of your app, if different):

aws ecs describe-services --cluster simple-webapp --services simple-webapp --output text

This should show what’s happening on the cluster in terms of task deployment, and hopefully you’ll eventually see that the process successfully starts, registers on the load balancer, and completes deployment, at which point it should reach a “steady state”:

EVENTS  2022-02-23T13:04:39.900000+00:00        3a087c70-aaa3-47d5-ae31-040db688155a    (service simple-webapp) has reached a steady state.
EVENTS  2022-02-23T13:04:39.899000+00:00        c0785dae-154d-440b-b315-f948901d48fb    (service simple-webapp) (deployment ecs-svc/4617274246689568181) deployment completed.
EVENTS  2022-02-23T13:04:20.239000+00:00        c60ce4fa-e7a6-4776-907b-b931a166109a    (service simple-webapp) registered 1 targets in (target-group arn:aws:elasticloadbalancing:eu-west-2:000000000000:targetgroup/simple-webapp/2ec4fbc39edca3aa)
EVENTS  2022-02-23T13:03:50.185000+00:00        2e2c4570-2bb3-45f3-83e6-84b61b9c63bb    (service simple-webapp) has started 1 tasks: (task 8b8f8d2258a74885b58e610fbf19a2cc).

Check the webapp via the ALB (ELB):

curl http://$(grep elb_dns etc/variables.yml | cut -d" " -f2)

Check the webapp using DNS (once the DNS has propagated, and replacing yourdomain.com with the domain you are using:

curl http://staging.yourdomain.com/

Get the container logs from running instances:

ansible -i etc/inventory.aws_ec2.yml -u ec2-user --private-key etc/ec2_key.pem tag_Environment_Production -m shell -a "docker ps | grep simple-webapp | cut -d\" \" -f1 | xargs docker logs"

You can also use that method to run ad hoc Ansible commands on the instances, e.g. uptime:

ansible -i etc/inventory.aws_ec2.yml -u ec2-user --private-key etc/ec2_key.pem tag_Environment_Production -m shell -a "uptime"

If you need to SSH to the instance, if there’s only one instance:

ssh -i etc/ec2_key.pem ec2-user@$(aws ec2 describe-instances --filters "Name=tag:Environment,Values=Production" --query "Reservations[*].Instances[*].PublicDnsName")

For multiple instances, list the public DNS names as follows, then SSH to each individually as needed:

aws ec2 describe-instances --filters "Name=tag:Environment,Values=Production" --query "Reservations[*].Instances[*].PublicDnsName"

Final thoughts

I hope this is a helpful guide for building and running containerised Docker apps on ECS using Ansible. If you need help with any of the issues raised in this article, or with any other infrastructure, automation, DevOps or SysAdmin projects or tasks, don’t hesitate to get in touch regarding the freelance services I offer.