Production EKS with Terraform

Published in

risertech

9 min readJan 15, 2020

It is a tired tale: 15 websites, blogs, Stack Overflow questions, etc. later and you still haven’t pieced it together. This is where I found myself, but I don’t want you to go through that same pain. Here it is: the guide to getting EKS working for real, in production.

The examples in this post are written in Terraform 0.12

What is EKS?

The Elastic Kubernetes Service (EKS) is a managed Kubernetes service. A Kubernetes installation has two parts — A control plane and a number of nodes.

The Control Plane

From the documentation:

The various parts of the Kubernetes Control Plane, such as the Kubernetes Master and kubelet processes, govern how Kubernetes communicates with your cluster. The Control Plane maintains a record of all of the Kubernetes Objects in the system, and runs continuous control loops to manage those objects’ state. At any given time, the Control Plane’s control loops will respond to changes in the cluster and work to make the actual state of all the objects in the system match the desired state that you provided.

Nodes

From the documentation:

The nodes in a cluster are the machines (VMs, physical servers, etc) that run your applications and cloud workflows. The Kubernetes master controls each node; you’ll rarely interact with nodes directly.

EKS provides you with a managed Control Plane. The machine(s) that make up the Control Plane are not visible to the owner of the cluster and cannot be reached or interacted with except through the kubectl command.

The nodes are setup by you and show up as AWS resources. You can attach security policies, control the networking, assign them to subnets, and generally have the same controls you have with any other EC2 resource. You can see and modify these resources through the CLI, API, and console just like any other EC2 resource. Once you have them setup most of your interaction with them will be indirect by issuing API commands to the master and letting Kubernetes use them efficiently.

Assumptions

I assume you know how to work with Terraform to create AWS resources. I assume you have a VPC, subnets, an internet gateway, etc. already created in Terraform scripts. This tutorial is designed to help you with the EKS part. I provide a complete explanation of how to use Terraform’s Kubernetes provider so no prior knowledge is needed there. I also assume that you are familiar with creating pods and deploying services to Kubernetes.

Setting up EKS is a two step process. First we create a cluster which is a managed Kubernetes control plane and second we create the nodes.

Setting up the managed Control Plane (EKS Cluster)

Before creating the cluster we first need to setup the role and security group.

Cluster security role

The role is pretty simple, it just states that eks is allowed to assume it.

Role policy attachments

These attachments grant the cluster the permissions it needs to take care of itself.

Next we need the security group that the cluster is going to run under

Cluster security group

This sets the VPC the cluster will run under, gives it unfettered egress access, and limits ingress to the specified internal subnets and the vpn subnet.

Now we are ready to actually create the cluster

Creates the cluster

You’ll notice that we reference the role and security groups that we created above. We also restate the internal subnets referred to in our security group. Lastly we give the cluster a private ip address and disable public ip addresses. This means that dns will in the VPC (either on an EC2 box, a docker container deployed on EKS, a machine on our VPN, etc) will get dns that resolves to the private IP and everything will work correctly.

Setting up the worker nodes

Now that you have a fully functioning cluster up and running, it is time to spin up some worker nodes.

The pattern is going to start out the same. First we need to create a role that the worker nodes are going to assume.

Node security role

This looks very similar to the previous role, but we are granting permissions to EC2 instead of EKS.

Role policy attachments

Here are the policy attachments for our node security role. You’ll notice there is a reference to “aws_iam_policy.alb-ingress.arn” which we haven’t setup yet. We’ll get to that when we start talking about the ALB ingress controller.

Instance Profile

We need to wrap this role in an instance profile. You’ll notice that when we setup the launch configuration below that it takes an instance profile instead of a role.

Next we are going to setup our security group

Node Security Group

This reenforces the VPC we are using and opens us up to egress anywhere on the internet. If this was an internal EKS cluster we could limit the egress if needed.

Security Group Rule

Our first security group rule is designed to open the ingress needed for the worker nodes to communicate with each other. In this case we leave all ports and protocols open but limit communication to our internal subnets. Notice we do not open this communication up to our VPN.

Security Group Role

In this case we open up ingress so that the EKS control plane can talk to the workers. At this point in time AWS does not provide us access to the IP ranges of the EKS cluster so we open one port to the world. This open port may bother the security conscious but it is important to remember that to authenticate with the service running on this port an attacker would need the private key to encrypt data.

Next we are actually going to setup the nodes. This is going to be a four step process. First we have to create the magic incantation that needs to be run the first time a new node comes up to join the EKS cluster. We are going to store this in a local for later use

Magical incantation for joining EKS clusters

Second we setup a filter which searches for the latest AMI for the particular cluster version we are using

Finds latest EKS AMI

After that we setup a launch configuration. Notice how we use the AMI id we found above as the image_id and we pass the magical incantation to the user_data_base64 parameter. Setting the lifecycle as create_before_destroy protects us from a nightmare scenario of having too many worker nodes deleted before new ones are spun up. Feel free to change the instance_type to support your workload

Launch Configuration

Lastly we setup an autoscaling group. Feel free to play with the numbers in the parameters desired_capacity, max_size, and min_size to support your use case.

Setting up the ALB Ingress Controller

Kubernetes does not provide a packaged way for nodes outside the cluster to reach into containers inside the cluster but it does provide an interface that allows others to write services that provide this functionality. This interface is the Ingress Controller. There are a number of Ingress Controllers available but since we are in the AWS world we are going to setup the ALB Ingress Controller. This has tight integration with the AWS security model and creates an ALB to manage reverse proxying.

ALB Ingress IAM Policy

Wow this is long. This is a Terraformed version of the policy file that can be found at https://raw.githubusercontent.com/kubernetes-sigs/aws-alb-ingress-controller/master/docs/examples/iam-policy.json. This is the example given in the ALB Ingress package. Feel free to check this file in case there are updates in the future.

Up until now we have been using Terraform’s AWS provider and the setup has been AWS specific. Notice now that we are starting to use Terraform’s Kubernetes provider. At this point we are in Kubernetes land and managing it directly through Terraform. These are all Terraformed versions of the yaml files you were normally work with in the Kubernetes ecosystem.

Before we start using the Kubernetes provider we will set it up

Kubernetes Provider

You’ll notice that we don’t have to deal with files or statically defined credentials like the Terraform documentation suggests we should use. We can get everything right out of aws_eks_cluster resource we created above. The load_config_file = false line is critical so the provider does not start looking for a config file on our file system.

ALB Ingress Cluster Role

The first thing we need to do is to create a cluster role. Remember this is a Kubernetes role and not an AWS role. We include two rules each of which sets up a set of privileges for a set of resources.

ALB Ingress Cluster Role Binding

Next we bind the cluster role to the ingress controller and the kube-system

ALB Ingress Service Account

Next we create the service account. As of this writing automount_service_account_token doesn’t work correctly but I left it in in case it begins working in the future.

ALB Ingress Deployment

Lastly we actually deploy the ALB ingress. The most important parts are the image whose version you may want to update from time to time, the args which should stay static, and the volume mount. The volume mount is supposed to automount based on your settings above, but here is how to set it up if automount does not get fixed.

Setting up Kubernetes Ingress

This is the Terraformed version of a Kubernetes ingress file. As of this writing every kubernetes_ingress resource you create will create an ALB. If you are interested in reducing the number of ALBs you have then it is recommended to put all ingress data in a single resource. There is an Ingress Group Feature under development that will allow you to share ALBs across different kubernetes_ingress resources but it seems to be stalled. You can follow the progress here: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/issues/914

Kubernetes Ingress

You will notice that this is setup to be an internet-facing alb. We reaffirm the subnets that this applies to and then give it a certificate arn in order to support https. Next we have some boiler plate for upgrading http traffic to https using the ssl-redirect action built into the alb ingress. In this example we add two hosts just to give an example what that will look like. At the beginning of each host we have some boilerplate to provide http -> https promotion and then typical Kubernetes path examples.

This next little bit shows how to use DNS with your Ingress.

ACM certificate

You may already have an SSL certificate, but here is how to do it from scratch. We used app.example.com and api.example.com in our examples above, and I assume there will be an example.com at some point. You may also create three separate certificates instead of a multi-domain certificate.

ACM Certificate Validation Records

Notice how we used DNS validation above? This is how to setup the validation records so that a human being does not have to be involved in certificate installation and/or rotation.

ACM Certificate Validation

Once the validation records are created above, this actually runs the validation.

A Note about Ingress

The Kubernetes Ingress (not the ALB Ingress) we setup will cause some errors in the Kubernetes logs if we run it before we have deployed those containers. This is fine and Kubernetes will continue to try to re-run the Ingress at regularly intervals (it seemed to run them about every 10 minutes for me). Once you deploy the containers specified in the Ingress file the errors will go away, but after the first deployment of those containers you may have up to a 10 minute wait before you can access them. Subsequent deploys of these containers will not have this problem.

Deploying Pods

Now that you have a cluster setup and can manage Ingress the question is how should you deploy pods? You can certainly deploy them through Terraform, but you are going to have a nightmare of a time managing the fast changing versions in containers that you develop in house. This leads to a pretty good rule of thumb. If you didn’t write it (like deploying an ELK stack) then it is probably worth managing through Terraform. On the other hand if you did write it then you probably want to manage deployment through your CI/CD pipeline outside of Terraform.

Deploying pods you developed internally through CI/CD gives dev teams the ability to manage their deployment.yaml, service.yaml, etc. files independently without having to go into the central Terraform files. This also allows them to do variable substitution on the version number assigned during the CI/CD pipeline.

If you really would like to keep internal dev deployment in Terraform then I would suggest you give each team/service it’s own Terraform module.

Running kubectl

The main tool for managing you cluster is kubectl which authenticates to the correct cluster through information in your ~/.kube/config file. EKS provides a utility for keeping that file up to date with the correct information. The utility can be run with aws eks update-kubconfig.

Conclusion

The EKS setup to get a production ready cluster working is pretty complex, but compared to the power and ease you are going to enjoy with your new Kubernetes cluster it is really worth it. Terraform gives you a nice Infrastructure As Code setup that can be checked into your favorite source code manager and run in different environments to provide the exact same infrastructure.

Need help with your devops journey into Infrastructure as Code (IaC)? Schedule a consultation at https://sleeptight.io