Our experience running an AI workload in Kubernetes - Part 4 <em>The Scaling Challenges</em>
KUBERNETES, KubeRay, RayService, AI Workloads, MULTI-CLOUD Jakub Hlavacka KUBERNETES, KubeRay, RayService, AI Workloads, MULTI-CLOUD Jakub Hlavacka

Our experience running an AI workload in Kubernetes - Part 4 <em>The Scaling Challenges</em>

In the previous part of this series, we walked through our migration from the RayCluster CRD to the RayService CRD. To complete the picture, this post covers the challenges we’ve faced and the improvements made to our setup running in a cost-optimized multi-cloud Kubernetes cluster.

Read More
Our experience running an AI workload in Kubernetes – Part 3 &lt;em&gt;Migration to RayService&lt;/em&gt;

Our experience running an AI workload in Kubernetes – Part 3 <em>Migration to RayService</em>

Brief outages caused by Ray head node restarts were no longer acceptable. In this post, we dive into our migration from the RayCluster CRD to the RayService CRD, which enabled rolling updates, external GCS storage, and more. We share how we tackled challenges such as unpredictable deployments, slow Ray worker nodes start-up, and ensuring high availability with Dragonfly. If you want to understand how to make Ray workloads more resilient, predictable, and production-ready on Kubernetes, this post walks through our practical solutions and lessons learned.

Read More
Our experience running an AI workload in Kubernetes – Part 2 &lt;em&gt;Limitations &amp; Pitfalls of our solution with RayCluster CRD&lt;/em&gt;

Our experience running an AI workload in Kubernetes – Part 2 <em>Limitations & Pitfalls of our solution with RayCluster CRD</em>

In this part of our series, we share the challenges we faced running Ray Serve Deployments in production using the RayCluster CRD. Along the way, we tackled issues like ephemeral head nodes, RayCluster’s autoscaling quirks, and the limitations of rolling updates. If you’re curious about bridging the gap between traditional Kubernetes workloads and the unique demands of AI applications on Ray, this post dives deep into using the RayCluster CRD in K8s.

Read More
Our experience running an AI workload in Kubernetes – Part 1 &lt;em&gt;Lift &amp; Shift Ray applications to K8s&lt;/em&gt;

Our experience running an AI workload in Kubernetes – Part 1 <em>Lift & Shift Ray applications to K8s</em>

In this post, we share our hands-on experience helping our client, Mixedbread, run their AI applications on Kubernetes using the KubeRay Operator. During the migration from a hyperscaler to a multi-cloud environment powered by claudie.io, we cut infrastructure costs by 70% while tackling challenges around RayCluster resilience, Ray Serve Deployments.

Read More