Simplifying EKS Deployments and Management

BY Bill Shetti
Mar 18 2019
15 Min

Deploying applications into the cloud is the norm. Majority of these applications are landing on AWS, GCP or Azure. In addition, more and more of these applications are also using containers and utilizing Kubernetes.

Kubernetes is becoming more mainstream and the “mainstay” in many organizations. Adoption is growing, as are the number of options for Kubernetes.

There are many Kubernetes choices to deploy your containerized application:

  1. Custom deployment solutions - from VMware Essential PKS, Kubespray, VMware Enterprise PKS, Stackpoint, etc.
  2. Turnkey solutions - AWS EKS, Azure AKS, GKE on GCP or on-prem, VMware Cloud PKS, etc.

Depending on your company’s needs, either one or both of these options maybe used. There is no right or wrong option. Turnkey options are, obviously the easier of the two options because they provide:

  1. Pre-built and managed clusters
  2. semi or fully automated cluster scaling
  3. well integrated security
  4. generally fully Kubernetes Conformant and hence eco-system friendly
  5. useable endpoints for common CI/CD tools like Jenkins, Spinnaker, etc
  6. Simple and easy update/upgrade options

But custom deployments give you unfettered control and customizations. In addition, there are lots of opensource tools to enable you to get an environment similar to a Turnkey solutions (i.e. Hashicorp Vault (key management), Heptio’s Velero (backup), etc.)

Given the trend to public cloud and the bulk of users on AWS, I will focus on what it takes to deploy and manage an application on AWS EKS. A common end point for deploying Kubernetes applications on AWS.

Operating on AWS is not simple, what it takes away in deploying and managing infrastructure (although this is also not as simple - as I will describe below), it adds back in complexity surrounding policies, roles, application security, and resource security. Hence managing anything in the cloud more about policy and security than about application configuration.

AWS EKS eliminates the need to deploy and manage Kubernetes as an operator. But it still relies on a significant amount of steps in creating baseline environment in order to instantiate the cluster. (vs GKE or Cloud PKS - which are my personal preferences for ease and sanity - but I will compare and contrast these in another blog)

I’ll cover the following in this blog:

  1. Solutions for deploying AWS EKS clusters - the hard way and an easy way to deploy an EKS cluster
  2. Making sense of access and control to the cluster using AWS IAM policies etc.
  3. How to manage the cluster via better observability, security, cost and resource management.

In highlighting these aspects, I will deploy the following application:

  1. AcmeShop - where you can buy the latest and greatest in fitness items and have it delivered by good old fashioned eco-friendly tractor delivery method.

Its a polyglot application my team and I built to highlight different features and capabilities surrounding application on Kubernetes.

Deploying an AWS EKS Cluster - The Easy Way or the Hard Way

When using AWS EKS, the goal is to get to the kubectl commandline prompt as fast as possible. The faster you can get access to this, the faster your application will also instantiate. Spending time spinning on IAM policies, AWS resource configurations is frustrating. Anything to minimize this is what I will show in the easy way

The Hard Way

Attempting to use AWS EKS you will generally follow the steps outlined in the following AWS EKS documentation:

AWS documentation is not known for being precise (in times it is and others its a bit behind changes to AWS). Hence there is always a small probability that the configuration steps are wrong.

Here are the steps to deploy an AWS EKS clusters per the documentation

  1. Step 1-Roles and user configuration:

You need to set up a proper role and use this appropriately through out the process. The two policies that need to part of any role are predefined in AWS and just need to be added to the role you define:


These policies amount to creating, describing and deleting EC2 instances, ELBs, subnets, and network interfaces.

Pitfalls: What AWS doesn’t emphasize (a bit hidden in the documentation) - Ensure you are always using the same user creating resources with this role. Otherwise you will have authentication issues on kubectl accessing the cluster.

i.e. if you create your VPCs, Instances, and EKS cluster with user: “Bob”. Make sure “Bob” is the aws cli user when using kubectl.

Heptio’s aws-iam-authenticator helps solve this problem, but may need some configuration depending on how you are actually using it.

  1. Step 2-Create your infra:

Manually setup your VPC, and then go about creating 2 subnets that reside on 2 different AZs in the region, and of course manually create the security groups.

Or use their predefined CF template for infra

Pitfalls: manual setup is absolutely not an option, and using the CF template is great. However the deployment of the roles in step 1 and options for things like autoscaling of node groups needs to be manually added.

  1. Step 3-Setup the EKS cluster:

Easy to setup, and it comes up quickly. Not much issues here as long as you follow the steps on the EKS setup pages.

  1. Step 4-Setting up kubectl

Setting up kubectl is easy following these instructions:

However you also need to setup aws-iam-authenticator - per the AWS documentation.

  1. Step 5-Finally run the following:

    aws eks --region region update-kubeconfig --name cluster_name

This sets the kubectl config with a key and lets you communicate with the cluster.

Pitfall If the kubeconfig is improperly you will experience access issues to the cluster from kubectl

This happens when there are issues in step 1. Hence you might even have to redeploy infra and the cluster from ground up is the roles are different.

Simple 5 step process right?

The Easy Way

Download the following tool from Weaveworks EKSCTL.

Follow the instructions on the github page to install then simply run:

eksctl create cluster [--name=<name>] [--region=<region>] [--nodes=<number of nodes] [--kubeconfig=<path to local kubeconfig>] --ash-access

a few more options exists (than what is listed above):

  1. create roles with the proper policies
  2. create a environment (using cloud formation) for the cluster to run (VPC, private and public subnets, NAT gateway, IGW, properly populated security groups, etc)
  3. Create the EKS cluster and the subsequent Instances
  4. Sets your kubeconfig and context


Happy YAMLing

The value is that everything EKSCTL builds is secure and self contained. You don’t have to keep track of anything, except potentially just the name of the cluster. All the resources are

  1. tagged and can be easily connected via tags
  2. can be configured for autoscaling with a simple flag in the command line
  3. preloads the proper rules in the security groups
  4. properly setups networking
  5. preloads your kubeconfig into the .kube directory

With this tool you can now

  1. get to the kubectl command prompt in a matter of minutes for a fresh new cluster vs spending say 30+ minutes on steps 1-5 in the hard way

  2. you can embed this tool into any CI/CD pipeline which allow you, as a cloud administrator to easily deploy and give users access to different clusters.

Managing access to clusters via IAM

Once the cluster is up only the user that set up the cluster has access by default. However you can add a specific or multiple users to the cluster. A couple of known methods. (not at all easy or simple for scale)

First method

Regardless of how you setup the cluster (hard way or the easy way) the following configmap exists on the cluster:

$ kubectl get configmap --all-namespaces
NAMESPACE     NAME                                 DATA   AGE
default       catalog-initdb-config                1      1d
default       users-initdb-config                  1      1d
kube-system   aws-auth                             1      1d
kube-system   coredns                              1      1d
kube-system   extension-apiserver-authentication   5      1d
kube-system   kube-proxy                           1      1d

The configmap will look like this on:

$ kubectl describe configmap -n kube-system aws-auth
Name:         aws-auth
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::123456789:role/eksctl-test2-nodegroup-ng-4a8d755-NodeInstanceRole-19F13V8IU0MOK
  username: system:node:{{EC2PrivateDNSName}}

Managing the data portion of the configmap, you can manage the users that have access to the cluster. The following parts in bold are what you can add to the configmap - by using the edit command

kubectl edit -n kube-system configmap/aws-auth

Next modifying the users:

apiVersion: v1
  mapRoles: |
    - rolearn: <arn>
      username: system:node:{{EC2PrivateDNSName}}
        - system:bootstrappers
        - system:nodes
**mapUsers: |
    - userarn: arn:aws:iam::555555555555:user/admin
      username: admin
        - system:masters
    - userarn: arn:aws:iam::111122223333:user/ops-user
      username: ops-user
        - system:masters**

By adding the appropriate ARNs to the configmap in a new mapUsers section, you can appropriately give access to different users on AWS.

The username and groups are found in Kubernetes Role and Role Bindings

The second method

A second way of also adding permissions is to give role access to each user via sts:AssumeRole Action.


Add the account that can “assume” access to the Trust Relationship for the IAM cluster roles

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::ACCOUNT ID HERE:root"
      "Action": "sts:AssumeRole"


For a given user (say “Bob”) you need to add a policy into their user profile.

Say the Cluster’s role arn is:


You would add this to “Bob’s” user as a policy (add a new one) under user “Bob’s” user profile:

  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Action": "sts:AssumeRole",
    "Resource": "arn:aws:iam::123456789:role/eksctl-test2-nodegroup-ng-4a8d755-NodeInstanceRole-19F13V8IU0MOK"

Advantages and Pitfalls of both

  1. method 1 is precise but its cumbersome and could mean building a secondary system keeping track of all the clusters and who gets attached. This method doesn’t scale
  2. method 2 is not precise but gets the job done. Essentially “Bob” gets access to the cluster BUT he can also get access to any cluster that uses the role he has “trust relationship” with. This method also doesn’t scale.

We’ll talk amore about the better methods in further blogs - i.e. GCP’s GKE or VMware’s Cloud PKS.

Managing the EKS cluster

Now that you have AWS EKS cluster up and running, how are you going to manage it? How is your operations going to change to accommodate it?

In the last section we discussed the un-solved issue surrounding IAM and policy management for clusters. This component is the most important and integral part of your operations, and we will discuss ideas, and solutions in up coming blogs.

What about the other parts:

  1. Observability - covering metrics collection, logging, and tracing
  2. Security (beyond RBAC access of the clusters)
  3. Cost and resource Management

These three make up the basic blocks of Day 2 operations. While you can use AWS services such as, cloudwatch, billing statements, cloudguard, etc, these services alone don’t give you sufficient information or analysis without custom code and integrations.

I’ll cover how several “off-the-shelf” tools can help make your Day 2 operations around the EKS cluster more simple, leaving you time to actually manage vs creating and managing custom code.

EKS Cost and Resource management

Understanding what you are using and spending on the cluster is critical for proper visibility. We’ll use CloudHealth by VMware to highlight how an “off-the-shelf” tool can provide proper visibility of the cluster and all its components. In addition, it will show you how to manage that spend on AWS appropriately.

When managing clusters on Kubernetes there are 2 schools of thought around operational boundaries. 1. Use namespace as an operational boundary to keep different developers segregated in a massive cluster 2. Use the cluster as the operational boundary and give developers individual small yet scalable clusters (like GKE and cloudPKS by VMware)

I’ll detail out the pros and cons of either of these operational options in a future blog. (this is a religious debate that is going on in the K8S community right now ;-). It really depends on the organizational needs and where you are in the Kubernetes life cycle. A great blog on this is from Jessie Fraz (

Regardless of operational boundary, we still need to see the usage and cost of a cluster and namespaces. (I personally an biased to more efficient smaller clusters)

Viewing the cluster and ALL its associated resources and overall cost

CloudHealth has a concept of something called Perspectives. These Perspectives are a fairly flexible way of creating a specific “view” into a set of resources in AWS. In this specific case, we created an EKS cluster via Weaveworks EKSCTL.

This essentially created a few components: 1. Roles and IAM policies 2. Infrastructure (VPCs, subnets, etc) that allow your to run the cluster on 3. Actually create the EKS cluster.

Using perspectives on CloudHealth by VMware, we can actually get a concise view of ALL the resources being consumed on AWS to support this EKS cluster.

Here you can see that I have created a Perspective detailing out my deployment of AcmeShop.

As you can see the Perspective we created has two groups. We created a group to house all the resources CloudFormation created, and another for the EKS cluster it self.

If we drill into the “AWS CloudFormation Stack” group, we see that it contains almost ALL the resources being used by the EKS cluster. Essentially is shows all the resources CloudFormation created on setup.

As you can see from the capture:

  1. Several IAM roles were recreated
  2. Several IAM policies were recreated
  3. VPC
  4. Internet Gateway (not shown)
  5. EKS Cluster control plane object
  6. Subnets (not shown)
  7. Routing (routes, route tables) (not shown)
  8. Security groups and the security group rules (not shown)
  9. NAT gateway (not shown)
  10. EIPs (not shown)
  11. Autoscaling group (not shown)

In addition, we can also see from the second group (AWSAssets Tagged), the instances and the attached volumes to these instances in AWS EC2.

What can I do with this perspective?

Here is a simple view of historical cost for my specific EKS cluster: (obtained by filtering the entire AWS spend against the perspective we created.)

As you can see in the image, this cluster was turned on recently (prior to writing this blog), and it still does not have any EC2 transfer cost associated with it.

In addition I can see cluster usage (this monitors the the # of instances in the cluster)

With this perspective you can build many more reports, and create a governance policy against it.

Here I have created a policy to evaluate whether or not the cluster’s average in bound traffic is less than 15% for at least 2 days. This helps me determine if we have a drop in usage on the EKS cluster because users’ are not hitting our site.

Viewing usage and cost on a namespace boundary for clusters

While I can get the total cluster information as noted above, I might also be carving the cluster for multiple projects and/or users. If I am doing this, then CloudHealth by VMware provides a mechanism to properly allocate and show individual namespace useage and cost.

In order to achieve this, I need to load in the CloudHealth container module. The container module will essentially help some metrics (see the Wavefront section below for details on metrics) but more interesting is that it helps CloudHealth associate the overall cost of the cluster (from the perspective detailed above) with individual namespaces.

Hence you can now get associated costs per namespaces which are associated with individuals, projects etc (however you have carved up the cluster with namespaces)

My cluster has only 2 namespaces (its a simple app) 1. default 2. kube-system

I loaded up the container module from CloudHealth with a simple set of commands:

export CHT_CLUSTER_NAME=dude
kubectl create secret generic --namespace default --from-literal=api-token=$CHT_API_TOKEN --from-literal=cluster-name=$CHT_CLUSTER_NAME cloudhealth-config
kubectl create -f kubernetes-collector-pod-template.yaml

XXXX is your individual token from your Cloudhealth instance.

Once its installed, I can then simply allocate the cluster cost to namespaces (see the cost distribution configuration below)

Once this is configured, I start to see my cost being allocated to different namespaces.

EKS’ Security for the deployed clusters

In this particular deployment we used the Weaveworks EKSCTL tool. Which does a solid job of creating the proper infrastructure. But how secure is our deployment?

One way to analyze this is to review all of the resources including the policies that were created by the EKSCTL tool. In our instantiation there were multiple subnets, routes, policies, roles, security groups created for our cluster.

Reviewing these resources for security violations is tedious. You can use tools like AWS Guardduty and/or AWS Threat Detector to provide some insight into potential issues. However these tools:

  1. just give you a a specific issues without any sort of correlation, or proper visualization capabilities (Example of the added setup needed to visualize Guardduty detection -
  2. must potentially be enabled per account and/or per region (vs the entire set of accounts)

One great “off-the-shelf” tool is VMware Secure State. This service (SaaS) allows you to detect all violations across ALL AWS accounts in a singular location, and it also allows visualizations of these issues in a more cohesive manner.

We added the account where we created the EKS cluster into VSS and we immediately found that a public route was created for the access to the cluster through the IGW that was set up by the EKSCTL tool.

As you can see VMware Secure State shows not only the issue Public Instance but it also shows the interconnected components to the instance. The instance, the route, the route table, etc. AWESOME RIGHT? Get early access to VMware Secure State

In the picture above, we can find the route associated with the gateway using the associated route table rtb-02fb0391605589f24 identifier.

Next we can easily find the route table in AWS, and change the route to limit access.

Hence - With VMware Secure State we were able to not only see the resources attached to the violation, but we could find the object and fix it.

Sign up for VMware Secure State here VMware Secure State Signup

Observability for EKS cluster

Finally, we need to get proper visibility into the cluster metrics and application metrics, logs, and be able to trace flows through the application. There are multiple tools to help achieve this.

  1. Metrics collection and analysis - Wavefront by VMware, Datadog, Prometheus (opensource),etc.
  2. Logging - Splunk, Elasticsearch on AWS, Stackdriver, etc.
  3. Tracing - Zipkin, Jaeger, Datadog, and Wavefront by VMware.

Using any of the tools above is fairly simple, however you might need to use multiple tools to achieve a total view. However one tool, Wavefront by VMware already has functionality on metrics, and tracability and its adding logging.

  1. Tracing and logging will be discussed in another blog, as we will show how to properly configure not only the cluster but also the application. With out application level tracing (i.e. Opentracing) its fairly un-informative. So look for a blog on application level tracing and logging on soon.
  2. Application level metrics - Here is a great blog I wrote a few months ago on how to add application level metrics and have it show up on Wavefront. The addition of the right components into the application is the hardest part (not showing up on Wavefront). Application level monitoring

In this section, I will show you how simple it is to see the cluster level metrics and how to analyze it with Wavefront by VMware

Once I have the EKS cluster up, and my application deployed, I simply add Wavefront Proxy and collector into the cluster. The wavefront proxy pulls metrics from kube-state-metrics setup by Wavefront and forwards this information to Wavefront by VMware

What is the difference between metrics-server and kube-state-metrics? (great description here )

kube-state-metrics vs. metrics-server The metrics-server is a project that has been inspired by Heapster and is implemented to serve the goals of the Kubernetes Monitoring Pipeline. It is a cluster level component which periodically scrapes metrics from all Kubernetes nodes served by Kubelet through Summary API. The metrics are aggregated, stored in memory and served in Metrics API format. The metric-server stores the latest values only and is not responsible for forwarding metrics to third-party destinations.

kube-state-metrics is focused on generating completely new metrics from Kubernetes’ object state (e.g. metrics based on deployments, replica sets, etc.). It holds an entire snapshot of Kubernetes state in memory and continuously generates new metrics based off of it. And just like the metric-server it too is not responsibile for exporting its metrics anywhere.

Having kube-state-metrics as a separate project also enables access to these metrics from monitoring systems such as Prometheus.

Setting it up yields the following additional pods and namespaces in the cluster. And setup is minutes.

ubuntu@ip-172-31-35-91:~$ kubectl get pods --all-namespaces
NAMESPACE             NAME                                   READY   STATUS    RESTARTS   AGE
default               cart-7b675b97f6-cqqmz                  1/1     Running   0          7d
default               cart-redis-6b47fffc96-bxvm7            1/1     Running   0          7d
default               catalog-84f6f9787f-hrcxg               1/1     Running   0          7d
default               catalog-mongo-67545ff846-xqfpn         1/1     Running   0          7d
default               frontend-c56b64cf7-scxt9               1/1     Running   0          7d
default               order-5c89fd8f56-gt6nh                 1/1     Running   0          7d
default               order-mongo-7d4448967-jfkp9            1/1     Running   0          7d
default               payment-6b9df9cf4c-9ltrl               1/1     Running   0          7d
default               users-6b485d77f-2n4cd                  1/1     Running   0          7d
default               users-mongo-564c49788c-86lp8           1/1     Running   0          7d
default               wavefront-proxy-85f5c9d7d-vnppt        1/1     Running   0          16h
kube-system           aws-node-847qh                         1/1     Running   0          7d
kube-system           aws-node-fw55d                         1/1     Running   0          7d
kube-system           coredns-5c466f5779-2dt84               1/1     Running   0          7d
kube-system           coredns-5c466f5779-6jw85               1/1     Running   0          7d
kube-system           kube-proxy-j7hlx                       1/1     Running   0          7d
kube-system           kube-proxy-jqvnk                       1/1     Running   0          7d
kube-system           kube-state-metrics-545cc95555-9t486    2/2     Running   0          16h
wavefront-collector   wavefront-collector-795c94d796-ssfhx   1/1     Running   0          16h

Once the proxy, the kube-state-metrics are up, metrics immediately show up on Wavefront by VMware and I simply need to go to a predefined Kubernetes dashboard in Wavefront by VMware, and select my cluster.

I have called my cluster AcmeShopEKSCluster in the Wavefront Proxy setup.

The resulting dashboard (overview) is:

I can further drill down into the cluster:

There is only so much real-estate I can capture, but you can see all the

  1. namespaces
  2. nodes
  3. pods
  4. containers in the pods
  5. services

running on the cluster.

With Wavefront you can also tie this cluster metric in with EC2 metrics. Remember the instances EKSCTL deployed? Wavefront by VMware also will enable you to tie in the metrics from these instances. Here you see one of my instances metrics.

Here are the nodes (instances on AWS EC2)

ubuntu@ip-172-31-35-91:~$ kubectl get nodes
NAME                                           STATUS   ROLES    AGE   VERSION    Ready    <none>   7d    v1.11.5   Ready    <none>   7d    v1.11.5

Here are the corresponding instance metrics in Wavefront by VMware

The image above shows Storage in AWS EC2, and I only had a specific amount of real-estate, so here is part 2


We’ve shown how you can properly deploy and manage an EKS Cluster on AWS.

  1. Solutions for deploying AWS EKS clusters - the hard way and an easy way to deploy an EKS cluster
  2. Making sense of access and control to the cluster using AWS IAM policies etc.
  3. How to manage the cluster via better observability, security, cost and resource management.

In addition we broke out operations into

  1. Cluster cost and resource management - via CloudHealth
  2. Cluster security on AWS - via VMware Secure State Signup
  3. Observability - via Wavefront by VMware

We’ll follow up on further blogs on IAM management with users, and more on application level tracing in the next few Kubernetes based blogs.