Inference with Claudie

In this blog post, we’ll walk through how to connect an on-premise server, equipped with custom hardware, with the AWS cloud provider, forming a hybrid cluster, and further use this infrastructure for running an AI workload.

Since our on-premise hardware is behind a NAT and lacks a public IP address, a requirement for being part of a Claudie infrastructure, we’ll use Tailscale to securely connect our local hardware to the rest of the infrastructure.

Prerequisite

  • An active AWS account enabled and access credentials as an accesskey and a secretkey. For more info, see docs.
    You can use any provider listed here. If your preferred provider isn’t listed, let us know and we’ll add it.

  • Active Tailscale account (free plan is all we need)

  • Kind or minikube binaries to run Claudie management cluster (can be deployed on a local machine)

  • On-premise hardware with an Ubuntu 24.04 LTS server with root access

  • kubectl installed for managing Kubernetes clusters.

Setting up environment

First, we need to deploy Claudie in our kind-based Kubernetes management cluster (referred to as the management cluster from here on).

$ kind create cluster --name mgmt-cluster

Installing Claudie is straightforward — just run the following commands:

$ kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
$ kubectl apply -f https://github.com/berops/claudie/releases/download/v0.9.14/claudie.yaml

After a few seconds, you should see running pods in the management cluster

kubectl get pod -n claudie
NAME                                READY   STATUS    RESTARTS   AGE
ansibler-66f6b88fb5-wdzz4           1/1     Running   0          1m
builder-796f74f85f-8k4np            1/1     Running   0          1m
claudie-operator-5db8f6bf65-vj6c4   1/1     Running   0          1m
dynamodb-6888c86497-sch9d           1/1     Running   0          1m
kube-eleven-56bd55576-6n8bl         1/1     Running   0          1m
kuber-cc7755f45-2rjp2               1/1     Running   0          1m
manager-bdcb4dc58-bn6hb             1/1     Running   0          1m
minio-0                             1/1     Running   0          1m
minio-1                             1/1     Running   0          1m
minio-2                             1/1     Running   0          1m
minio-3                             1/1     Running   0          1m
mongodb-6d59c69c99-t2w65            1/1     Running   0          1m
terraformer-6485b876d6-46znn        1/1     Running   0          1m

Since our GPU server isn’t publicly accessible, we need to establish network connectivity between the management cluster and the on-premise GPU server. To achieve this, we set up a Tailscale VPN between the two environments. This allows the management cluster to communicate with the GPU server over a private network securely. We followed this tutorial to deploy Tailscale on both the GPU server and the computer running kind with management cluster.

After installing Tailscale, we bring up the VPN by running tailscale up on each machine and completing the authentication process via the link provided in the terminal.

Once authenticated, you should see both machines, your management cluster node and the GPU server, listed in the Tailscale dashboard as connected devices.

Tailscale dashboard

Next, we need to generate an SSH key pair for the root user on the management cluster and copy the public key to the GPU server. This enables passwordless SSH access between the two machines.

Once the public key is added to the GPU server’s authorized_keys, you should be able to SSH into it from the management cluster without being prompted for a password.

At this stage, Claudie is deployed on the Kubernetes management cluster, and secure SSH access to the on-premise GPU server is established. We can move on to build hybrid-cloud with AWS.

Management cluster GPU server connection

Creating a hybrid-cloud cluster with Claudie

To create a hybrid-cloud cluster, the first step is to define the credentials for accessing AWS provider within the management cluster.

$ kubectl create secret generic aws-secret \
 --namespace=claudie\
 --from-literal=accesskey='<YOUR AWS ACCESS KEY >' \
 --from-literal=secretkey='<YOUR AWS SECRET KEY>'

We also need to store the previously generated private SSH key as a Kubernetes secret in the management cluster.

$ kubectl create secret generic static-node-key \
 --namespace=claudie \
 --from-file=privatekey=private.pem

Following that, we define our infrastructure using a YAML manifest. This manifest serves as a template that Claudie will use to provision and configure the entire infrastructure.

apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
  name: aws-hybrid-cloud
  labels:
    app.kubernetes.io/part-of: claudie
spec:
  providers:
  # Previously defined credentials to AWS
    - name: aws
      providerType: aws
      templates:
        repository: "https://github.com/berops/claudie-config"
        path: "templates/terraformer/aws"
        tag: "v0.9.15"
      secretRef:
        name: aws-secret
        namespace: claudie
  nodePools:
    dynamic:
      - name: loadbalancer
        providerSpec:
          # Name of the provider instance.
          name: aws
          # Region of the nodepool.
          region: eu-central-1
          # Zone of the nodepool
          zone: eu-central-1a
        # For testing purposes, we define only one controller node
        count: 1
        # Machine type name.
        serverType: t3.medium
        # OS image name.
        image: ami-07eef52105e8a2059

      - name: control-node
        providerSpec:
          # Name of the provider instance.
          name: aws
          # Region of the nodepool.
          region: eu-central-1
          # Zone of the nodepool
          zone: eu-central-1a
        # For testing purposes, we define only one controller node
        count: 1
        # Machine type name.
        serverType: t3.medium
        # OS image name.
        image: ami-07eef52105e8a2059

      - name: compute-node
        providerSpec:
         # Name of the provider instance.
         name: aws
         # Region of the nodepool.
         region: eu-central-1
         # Zone of the nodepool
         zone: eu-central-1a
         # Define GPU autoscaling
        autoscaler:
          min: 0
          max: 20
        # GPU machine type name.
        serverType: g4dn.xlarge
        # OS image name
        image: ami-07eef52105e8a2059
        # Define num of GPUs serverType has
        machineSpec:
          nvidiaGpu: 1
    static:
      # On-premises hardware
      - name: compute-static
        nodes:
          # The IP address assigned to our GPU server by Tailscale.
          - endpoint: "100.66.196.4"
            secretRef:
              name: static-node-key
              namespace: claudie
  kubernetes:
    clusters:
      - name: hybrid-cluster
        # Deployed Kubernetes version
        version: v1.34.0
        # Internal hybrid-cloud network
        network: 192.168.2.0/24
        pools:
          control:
            - control-node
          compute:
            - compute-node
            - compute-static
  # Loadbalancer definition
  loadBalancers:
   roles:
     # Kubernetes API
     - name: apiserver
       protocol: tcp
       port: 6443
       targetPort: 6443
       targetPools:
           - control-node
     # API endpoint for AI model
     - name: llm-api
       protocol: tcp
       # Frontend port on LB
       port: 80
       # NodePort where LB will forward traffic to backends
       targetPort: 30143 
       targetPools:
           - compute-node
           - compute-static
   clusters:
     - name: apiserver-lb
       roles:
         - apiserver
         - llm-api
       # DNS configuration for our cluster
       dns:
         dnsZone: aws.e2e.claudie.io
         provider: aws
         hostname: demo-llm
       targetedK8s: hybrid-cluster
       pools:
         - loadbalancer

In the provided manifest, we specify the deployment of one loadbalancer, one Kubernetes control plane node (without GPU), and a dynamic pool of 0 to 20 AWS GPU instances that scale based on workload demands. Additionally, our on-premise GPU server is integrated into the cluster, forming a hybrid-cloud setup.

Notice that our domain will be hosted on Amazon route53 service, and the name for our AI model will be demo-llm.aws.e2e.claudie.io. For this test deployment, we omitted HTTPS.

We deploy this manifest and wait for the deployment to complete. We can monitor the cluster’s stage by running watch command:

$ kubectl -n claudie create -f manifest.yaml
$ watch 'kubectl get inputmanifests.claudie.io aws-hybrid-cloud -ojsonpath='{.status}' | jq .'

or for more detailed information, watch logs in builder pod:

k logs -f -l app.kubernetes.io/name=builder

To view the deployed hybrid cloud cluster, we export the kubeconfig and then run kubectl get nodes.

$ kubectl get secrets \
-n claudie -l claudie.io/output=kubeconfig \
-o jsonpath='{.items[0].data.kubeconfig}' | base64 -d > my-super-cluster-kubeconfig.yaml

$ kubectl --kubeconfig ./my-super-cluster-kubeconfig.yaml get nodes -o wide
NAME                      STATUS   ROLES           AGE   VERSION   INTERNAL-IP   NAME   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
compute-static-01         Ready    <none>          7m51s   v1.34.0   192.168.2.2   <none>             Ubuntu 24.04.3 LTS   6.14.0-1014       containerd://1.7.27
control-node-ledzbpa-01   Ready    control-plane   10m     v1.34.0   192.168.2.1   <none>             Ubuntu 24.04.1 LTS   6.8.0-1021-aws    containerd://1.7.27

As shown in the output above, we have one AWS control plane node and our on-premise GPU server successfully connected, with communication established over the 192.168.2.0/24 network.

We don’t see any AWS GPU nodes yet because the autoscaler is configured with a minimum of 0 nodes, and so far, no workloads have been deployed.

Claudie hybrid-cloud infrastructure

Deploying the GPU operator

After the InputManifest was successfully built by Claudie, we deploy the GPU Operator to discover nodes containing GPUs and will install the necessary drivers and packages to run GPU workloads.

Firstly, we create a namespace for the gpu-operator and label the namespace for the operator to set the enforcement policy to privileged.

$ kubectl create ns gpu-operator
$ kubectl label --overwrite ns gpu-operator \
   pod-security.kubernetes.io/enforce=privileged

Next, we add the Nvidia helm repository and install the operator.

$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install --wait --generate-name \
  -- version v25.3.4
  -n gpu-operator --create-namespace \
   nvidia/gpu-operator

We wait for the pods in the gpu-operator namespace to be ready, and after that, we can check if GPUs can be used and on which node.

$ kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}'
{
  "name": "compute-static-01",
  "gpus": "1"
}
{
  "name": "control-node-ledzbpa-01",
  "gpus": null
}

From the output above, we can see that the GPU Operator correctly identifies our on-premises node that has one GPU installed.

Install Kserve

For deploying and testing machine learning models on Kubernetes, we will use the Kserve open-source project. With the Kserve, we will install Envoy Gateway API to manage ingress and egress traffic in Kubernetes clusters.

$ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.1/standard-install.yaml
$ helm install eg oci://docker.io/envoyproxy/gateway-helm --version v1.5.0 -n envoy-gateway-system --create-namespace

Then, create a GatewayClass and Gateway resource to expose the InferenceService.

apiVersion: v1
kind: Namespace
metadata:
  labels:
    kubernetes.io/metadata.name: kserve
  name: kserve
---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
 name: envoy
spec:
 controllerName: gateway.envoyproxy.io/gatewayclass-controller 
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
 name: kserve-ingress-gateway
 namespace: kserve
spec:
 gatewayClassName: envoy
 # Reference to the EnvoyProxy custom resource that defines NodePorts and other data plane settings
 infrastructure:
   parametersRef:
     group: gateway.envoyproxy.io
     kind: EnvoyProxy
     name: envoy-proxy
 listeners:
   - name: http
     protocol: HTTP
     port: 80
     allowedRoutes:
       namespaces:
         from: All
   - name: https
     protocol: HTTPS
     port: 443
     tls:
       mode: Terminate
       certificateRefs:
         - kind: Secret
           name: kserve-certificate
           namespace: kserve
     allowedRoutes:
       namespaces:
         from: All
 infrastructure:
   labels:
     serving.kserve.io/gateway: kserve-ingress-gateway
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: envoy-proxy
  namespace: kserve
  labels:
    envoy-proxy: claudie-node-port
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        externalTrafficPolicy: Cluster
        labels: 
          envoy-proxy: claudie-node-port
        type: NodePort
        patch:
          type: StrategicMerge
          value:
            spec:
              ports:
                - name: http-80
                  nodePort: 30143
                  port: 80
                  protocol: TCP

The above manifest deploys a service of type NodePort in namespace envoy-gateway-system through which traffic will be later forwarded to kserve pod on port 80.

Now we will continue by installing Kserve CRD and Kserve controller

$ helm install kserve-crd oci://ghcr.io/kserve/charts/kserve-crd \
  --version v0.15.0 \
  --namespace kserve
$ helm install kserve oci://ghcr.io/kserve/charts/kserve \
  --version v0.15.0 \
  --namespace kserve \
  --set kserve.controller.deploymentMode=RawDeployment \
  --set kserve.controller.gateway.ingressGateway.enableGatewayApi=true \
  --set kserve.controller.gateway.ingressGateway.kserveGateway=kserve/kserve-ingress-gateway \
  --set kserve.controller.gateway.domain=demo-llm.aws.e2e.claudie.io

This will deploy the kserve-controller-manager, which manages HTTPRoutes and horizontal pod autoscaling based on CPU usage. Later, we will configure it to monitor GPU utilization using KEDA.

For more details on the KServe architecture used in our setup, see the official doc.

Deploying and autoscaling LLM

To enable dynamic cluster scaling, we need to deploy GPU-based workloads. As a demonstration, we’ll deploy the Qwen2.5–0.5B-Instruct-AWQ language model, configure the cluster to monitor GPU utilization, and observe the autoscaler in action.

Create an InferenceService

Creating InferenceService will deploy the Qwen LLM model, which will be served using KServe’s Hugging Face runtime with vLLM backend for optimized performance. We will deploy InferenceService in a separate namespace called kserve-qwen.

$ kubectl create ns kserve-qwen
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "qwen-llm"
  namespace: kserve-qwen
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=qwen
      storageUri: "hf://Qwen/Qwen2.5-0.5B-Instruct-AWQ"
      resources:
        limits:
          cpu: "2"
          memory: 14Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 4Gi
          nvidia.com/gpu: "1"

After applying the InferenceService we will wait until we see a ready status. This might take a few minutes.

$ kubectl get isvc -n kserve-qwen
NAMESPACE     NAME       URL                                                       READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
kserve-qwen   qwen-llm   http://qwen-llm-kserve-qwen.demo-llm.aws.e2e.claudie.io   True                                                                  9m48s

From the above output, we can see that InferenceService generated the URL http://qwen-llm-kserve-qwen.demo-llm.aws.e2e.claudie.io that we will later use to perform inference. However, we need to create a CNAME record in our DNS provider (AWS, in our case) that points to the A record demo-llm.aws.e2e.claudie.io.

Deploying this InferenceService, we still won’t be able to call LLM, because our Claudie Loadbalancer is using proxy protocol to keep the origin IP and port of the caller. For that reason, we will also need to enable proxy protocol in Envoy Gateway.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
  name: client-traffic-config
  namespace: kserve
spec:
  enableProxyProtocol: true
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: kserve-ingress-gateway

Once the ClientTrafficPolicy has been applied (this may take a few seconds), we can verify the inference functionality using a curl command that calls the previously created CNAME record.

$ curl -v -H "Content-Type: application/json"   http://qwen-llm-kserve-qwen.demo-llm.aws.e2e.claudie.io/openai/v1/chat/completions -d @./chat-input.json

We use the following chat-input.json file as the input.

$ cat <<EOF > "./chat-input.json"
{
  "model": "qwen",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that provides long answers."
    },
    {
      "role": "user",
      "content": "What is the difference between Kubernetes Deployment and StatefulSet"
    }
  ],
  "max_tokens": 1500,
  "temperature": 0.5,
  "stream": false
}
EOF

Deploying the InferenceService, we also deployed Horizontal Pod Autoscaling. However, by default, it only monitors CPU utilization, which is not adequate for our workload.

$ kubectl get hpa -n kserve-qwen
NAME                 REFERENCE                       TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
qwen-llm-predictor   Deployment/qwen-llm-predictor   cpu: 1%/80%   1         1         1          13m

Installing KEDA for GPU autoscaling

To enable cluster autoscaling based on GPU memory utilization, we’ll export metrics from the already-installed GPU Operator to Prometheus. Based on a defined memory threshold, new pods will be scheduled, which, in turn, will trigger the provisioning of a new node.

Following the KEDA installation, we add the helm repo and install charts:

$ helm repo add kedacore https://kedacore.github.io/charts
$ helm repo update
$ helm install keda kedacore/keda \ 
--version 2.17.2 \
--namespace keda --create-namespace

To install Prometheus, we use the kube-prometheus-stack helm repo:

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm inspect values prometheus-community/kube-prometheus-stack > prometheus-values.yaml

Running the above command, we created a prometheus-values.yaml file where we modify and add additional configuration for the GPU operator.

serviceMonitorSelectorNilUsesHelmValues: false 

additionalScrapeConfigs: |
 - job_name: gpu-metrics
   scrape_interval: 1s
   metrics_path: /metrics
   scheme: http
   kubernetes_sd_configs:
   - role: endpoints
     namespaces:
       names:
       - gpu-operator
   relabel_configs:
   - source_labels: [__meta_kubernetes_endpoints_name]
     action: drop
     regex: .*-node-feature-discovery-master
   - source_labels: [__meta_kubernetes_pod_node_name]
     action: replace
     target_label: kubernetes_node

After that, we install kube-prometheus-stack with our additional GPU scrape config.

$ helm install prometheus-community/kube-prometheus-stack \
   --version 77.11.1 \
   --create-namespace --namespace prometheus \
   --generate-name \
   -f prometheus-values.yaml

Next, we define a ScaledObject (see KEDA documentation for more details) to specify the parameter to monitor, its threshold, and the conditions for autoscaling. For a full list of metrics available from the NVIDIA GCGM exporter, visit official NVIDIA docs.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: qwen-llm-predictor-scaled-object
  namespace: kserve-qwen
  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  advanced:
   horizontalPodAutoscalerConfig:
     name: qwen-llm-predictor
  scaleTargetRef:
    name: qwen-llm-predictor
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-operated.prometheus:9090
      metricName: gpu-util
      threshold: "60"
      query: sum(avg_over_time(DCGM_FI_DEV_GPU_UTIL{pod=~"qwen-llm-predictor.*"}[30s]))

After applying ScaledObject resource, we will transfer ownership of the existing HPA to our defined ScaledObject with custom triggers.

$ kubectl get hpa -n kserve-qwen
NAME                 REFERENCE                       TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
qwen-llm-predictor   Deployment/qwen-llm-predictor   cpu: 1%/80%   1         1         1          3h16m


$ kubectl apply -f  scaledobject.yaml
$ kubectl get hpa -n kserve-qwen
NAME                               REFERENCE                       TARGETS      MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-gpu-dcgm-scaled-object   Deployment/qwen-llm-predictor   0/95 (avg)   1         20        1          11m

System Behavior Under High Demand

With Claudie and the InferenceService deployed alongside a ScaledObject configured to monitor GPU memory usage per node, we can now run a workload that utilizes the GPU. This will trigger Claudie’s autoscaler to scale out by adding new nodes.

We will simulate a high workload by running the previously defined curl command in a loop from multiple clients, while monitoring events in the kserve-qwen namespace.

$ curl -v -H "Content-Type: application/json"   http://qwen-llm-kserve-qwen.demo-llm.aws.e2e.claudie.io/openai/v1/chat/completions -d @./chat-input.json

$ kubectl events  -n kserve-qwen -w 

0s (x4 over 2m28s)       Normal    ScaledObjectReady              ScaledObject/qwen-llm-predictor-scaled-bject   ScaledObject is ready for scaling
0s (x2 over 34s)         Warning   FailedScheduling               Pod/qwen-llm-predictor-7ddb55c8f4-lt9gq        0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
0s                       Normal    TriggeredScaleUp               Pod/qwen-llm-predictor-7ddb55c8f4-lt9gq        pod triggered scale-up: [{compute-node-dwmov7w 0->1 (max: 20)}]

By inspecting the autoscaler pod in our kind management cluster we can see that Claudie detects the need and initiates scheduling of new GPU nodes.

$ kubectl logs -f autoscaler-hybrid-cluster-qpk0iqu-6ff7fc6b88-69hpd

2025-09-26T08:28:51Z INF Got NodeGroupIncreaseSize request for nodepool by 1 cluster=hybrid-cluster-qpk0iqu module=autoscaler-adapter-hybrid-cluster nodepool=compute-node-dwmov7w

In management cluster we can verify that indeed we now have a new AWS node compute-node-dwmov7w-01.

$ kubectl --kubeconfig ./my-super-cluster-kubeconfig.yaml  get node
NAME                      STATUS   ROLES           AGE    VERSION
compute-node-dwmov7w-01   Ready    <none>          107s   v1.32.0
compute-static-01         Ready    <none>          39m    v1.32.0
control-node-4ar7hnw-01   Ready    control-plane   41m    v1.32.0

Note, adding a new auto-scheduled node may take several minutes. This process involves creating a virtual machine, installing required packages, and deploying necessary DaemonSets, such as GPU operators, CNI drivers, CSI drivers, etc..

The total provisioning time can also vary based on several factors, primarily network bandwidth and the disk performance of the attached virtual machine.

Wrapping up

In this post, we demonstrated how Claudie can be used to connect multiple nodes, whether they’re on-premise hardware or cloud-based VMs, into a single unified Kubernetes cluster.

With the help of the autocluster scheduler, we also showed how to scale up the cluster automatically. Scale down is triggered when nodes remain unneeded for more than 10 minutes. You can find more details about how autoscaling works in Claudie in this article.

Next
Next

Our experience running an AI workload in Kubernetes - Part 4 <em>The Scaling Challenges</em>