AWS EKS Setup

Overview

This guide walks you through creating an Amazon EKS cluster optimized for running Smallest Self-Host with GPU acceleration.

Prerequisites

AWS CLI

Install and configure AWS CLI:

aws --version
aws configure

eksctl

Install eksctl (EKS cluster management tool):

brew install eksctl

Verify:

eksctl version

kubectl

Install kubectl:

brew install kubectl

IAM Permissions

Ensure your AWS user/role has permissions to:

Create EKS clusters
Manage EC2 instances
Create IAM roles
Manage VPC resources

Cluster Configuration

Option 1: Quick Start with eksctl

Create a cluster with GPU nodes using a single command:

eksctl create cluster \
  --name smallest-cluster \
  --region us-east-1 \
  --version 1.28 \
  --nodegroup-name cpu-nodes \
  --node-type t3.large \
  --nodes 2 \
  --nodes-min 1 \
  --nodes-max 3 \
  --managed

Then add GPU node group:

eksctl create nodegroup \
  --cluster smallest-cluster \
  --region us-east-1 \
  --name gpu-nodes \
  --node-type g5.xlarge \
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 5 \
  --managed \
  --node-labels "workload=gpu,nvidia.com/gpu=true" \
  --node-taints "nvidia.com/gpu=true:NoSchedule"

This creates a cluster with separate CPU and GPU node groups, allowing for cost-effective scaling.

Option 2: Using Cluster Config File

Create a cluster configuration file for more control:

cluster-config.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: smallest-cluster
  region: us-east-1
  version: "1.28"

iam:
  withOIDC: true

managedNodeGroups:
  - name: cpu-nodes
    instanceType: t3.large
    minSize: 1
    maxSize: 3
    desiredCapacity: 2
    volumeSize: 50
    ssh:
      allow: false
    labels:
      workload: cpu
    tags:
      Environment: production
      Application: smallest-self-host

  - name: gpu-nodes
    instanceType: g5.xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 1
    volumeSize: 100
    ssh:
      allow: false
    labels:
      workload: gpu
      nvidia.com/gpu: "true"
      node.kubernetes.io/instance-type: g5.xlarge
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
    tags:
      Environment: production
      Application: smallest-self-host
      NodeType: gpu
    iam:
      withAddonPolicies:
        autoScaler: true
        ebs: true
        efs: true

addons:
  - name: vpc-cni
  - name: coredns
  - name: kube-proxy
  - name: aws-ebs-csi-driver

Create the cluster:

eksctl create cluster -f cluster-config.yaml

Cluster creation takes 15-20 minutes. Monitor progress in the AWS CloudFormation console.

GPU Instance Types

Choose the right GPU instance type for your workload:

Instance Type	GPU	VRAM	vCPUs	RAM	$/hour*	Recommended For
g5.xlarge	1x A10G	24 GB	4	16 GB	$1.00	Development, testing
g5.2xlarge	1x A10G	24 GB	8	32 GB	$1.21	Small production
g5.4xlarge	1x A10G	24 GB	16	64 GB	$1.63	Medium production
g5.12xlarge	4x A10G	96 GB	48	192 GB	$5.67	High-volume production
p3.2xlarge	1x V100	16 GB	8	61 GB	$3.06	Legacy workloads

Recommendation: Start with g5.xlarge for development and testing. Scale to g5.2xlarge or higher for production.

Verify Cluster

Check Cluster Status

eksctl get cluster --name smallest-cluster --region us-east-1

Verify Node Groups

eksctl get nodegroup --cluster smallest-cluster --region us-east-1

Configure kubectl

aws eks update-kubeconfig --name smallest-cluster --region us-east-1

Verify access:

kubectl get nodes

Expected output:

NAME                         STATUS   ROLES    AGE   VERSION
ip-xxx-cpu-1                 Ready    <none>   5m    v1.28.x
ip-xxx-cpu-2                 Ready    <none>   5m    v1.28.x
ip-xxx-gpu-1                 Ready    <none>   5m    v1.28.x

Verify GPU Nodes

Check GPU availability:

kubectl get nodes -l workload=gpu -o json | \
  jq '.items[].status.capacity'

Look for nvidia.com/gpu in the output:

{
  "cpu": "4",
  "memory": "15944904Ki",
  "nvidia.com/gpu": "1",
  "pods": "29"
}

Install NVIDIA Device Plugin

The NVIDIA device plugin enables GPU scheduling in Kubernetes.

Using Helm (Recommended)

The Smallest Self-Host chart includes the NVIDIA GPU Operator. Enable it in your values:

values.yaml

gpu-operator:
  enabled: true

Manual Installation

If installing separately:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify:

kubectl get pods -n kube-system | grep nvidia

Install EBS CSI Driver

Required for persistent volumes:

Using eksctl

eksctl create addon \
  --name aws-ebs-csi-driver \
  --cluster smallest-cluster \
  --region us-east-1

Using AWS Console

Navigate to EKS → Clusters → smallest-cluster → Add-ons
Click “Add new”
Select “Amazon EBS CSI Driver”
Click “Add”

Verify EBS CSI Driver

kubectl get pods -n kube-system -l app=ebs-csi-controller

Install EFS CSI Driver (Optional)

Recommended for shared model storage across pods.

Create IAM Policy

curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-efs-csi-driver/master/docs/iam-policy-example.json

aws iam create-policy \
  --policy-name AmazonEKS_EFS_CSI_Driver_Policy \
  --policy-document file://iam-policy.json

Create IAM Service Account

eksctl create iamserviceaccount \
  --cluster smallest-cluster \
  --region us-east-1 \
  --namespace kube-system \
  --name efs-csi-controller-sa \
  --attach-policy-arn arn:aws:iam::YOUR_ACCOUNT_ID:policy/AmazonEKS_EFS_CSI_Driver_Policy \
  --approve

Replace YOUR_ACCOUNT_ID with your AWS account ID.

Install EFS CSI Driver

kubectl apply -k "github.com/kubernetes-sigs/aws-efs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.7"

Verify:

kubectl get pods -n kube-system -l app=efs-csi-controller

Enable Cluster Autoscaler

See the Cluster Autoscaler guide for detailed setup. Quick setup:

eksctl create iamserviceaccount \
  --cluster smallest-cluster \
  --region us-east-1 \
  --namespace kube-system \
  --name cluster-autoscaler \
  --attach-policy-arn arn:aws:iam::aws:policy/AutoScalingFullAccess \
  --approve \
  --override-existing-serviceaccounts

Cost Optimization

Use Spot Instances for GPU Nodes

Reduce costs by up to 70% with Spot instances:

cluster-config.yaml

managedNodeGroups:
  - name: gpu-nodes-spot
    instanceType: g5.xlarge
    minSize: 0
    maxSize: 5
    desiredCapacity: 1
    spot: true
    instancesDistribution:
      maxPrice: 0.50
      instanceTypes: ["g5.xlarge", "g5.2xlarge"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: capacity-optimized

Spot instances can be interrupted with 2-minute warning. Ensure your application handles graceful shutdowns.

Right-Size Node Groups

Start small and scale based on metrics:

managedNodeGroups:
  - name: gpu-nodes
    minSize: 0
    maxSize: 10
    desiredCapacity: 1

Set minSize: 0 to scale down to zero during off-hours.

Enable Cluster Autoscaler

Automatically adjust node count based on demand:

values.yaml

cluster-autoscaler:
  enabled: true
  autoDiscovery:
    clusterName: smallest-cluster
  awsRegion: us-east-1

Security Best Practices

Enable Private Endpoint

eksctl utils update-cluster-endpoint \
  --cluster smallest-cluster \
  --region us-east-1 \
  --private-access=true \
  --public-access=false

Enable Logging

eksctl utils update-cluster-logging \
  --cluster smallest-cluster \
  --region us-east-1 \
  --enable-types all \
  --approve

Update Security Groups

Restrict inbound access to API server:

aws ec2 describe-security-groups \
  --filters "Name=tag:aws:eks:cluster-name,Values=smallest-cluster"

Update rules to allow only specific IPs.

Troubleshooting

GPU Nodes Not Ready

Check NVIDIA device plugin:

kubectl get pods -n kube-system | grep nvidia
kubectl describe node <gpu-node-name>

Pods Stuck in Pending

Check node capacity:

kubectl describe pod <pod-name>
kubectl get nodes -o json | jq '.items[].status.allocatable'

EBS Volumes Not Mounting

Verify EBS CSI driver:

kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl logs -n kube-system -l app=ebs-csi-controller

What’s Next?

IAM & IRSA

Configure IAM roles for service accounts

GPU Nodes

Advanced GPU node configuration and optimization

EFS Configuration

Set up shared file storage for models

Cluster Autoscaler

Enable automatic node scaling

Getting Started

Docker Setup

Kubernetes Setup

Troubleshooting

​Overview

​Prerequisites

​Cluster Configuration

​Option 1: Quick Start with eksctl

​Option 2: Using Cluster Config File

​GPU Instance Types

​Verify Cluster

​Check Cluster Status

​Verify Node Groups

​Configure kubectl

​Verify GPU Nodes

​Install NVIDIA Device Plugin

​Using Helm (Recommended)

​Manual Installation

​Install EBS CSI Driver

​Using eksctl

​Using AWS Console

​Verify EBS CSI Driver

​Install EFS CSI Driver (Optional)

​Create IAM Policy

​Create IAM Service Account

​Install EFS CSI Driver

​Enable Cluster Autoscaler

​Cost Optimization

​Use Spot Instances for GPU Nodes

​Right-Size Node Groups

​Enable Cluster Autoscaler

​Security Best Practices

​Enable Private Endpoint

​Enable Logging

​Update Security Groups

​Troubleshooting

​GPU Nodes Not Ready

​Pods Stuck in Pending

​EBS Volumes Not Mounting

​What’s Next?

IAM & IRSA

GPU Nodes

EFS Configuration

Cluster Autoscaler

Overview

Prerequisites

Cluster Configuration

Option 1: Quick Start with eksctl

Option 2: Using Cluster Config File

GPU Instance Types

Verify Cluster

Check Cluster Status

Verify Node Groups

Configure kubectl

Verify GPU Nodes

Install NVIDIA Device Plugin

Using Helm (Recommended)

Manual Installation

Install EBS CSI Driver

Using eksctl

Using AWS Console

Verify EBS CSI Driver

Install EFS CSI Driver (Optional)

Create IAM Policy

Create IAM Service Account

Install EFS CSI Driver

Enable Cluster Autoscaler

Cost Optimization

Use Spot Instances for GPU Nodes

Right-Size Node Groups

Enable Cluster Autoscaler

Security Best Practices

Enable Private Endpoint

Enable Logging

Update Security Groups

Troubleshooting

GPU Nodes Not Ready

Pods Stuck in Pending

EBS Volumes Not Mounting

What’s Next?