Our experience running an AI workload in Kubernetes - Part 4 <em>Challenges and Improvements</em>
In the previous part of this series, we walked through our migration from the RayCluster CRD to the RayService CRD. To complete the picture, this post covers the challenges we’ve faced and the improvements made to our setup running in a cost-optimized multi-cloud Kubernetes cluster.
Our experience running an AI workload in Kubernetes – Part 3 <em>Migration to RayService</em>
Brief outages caused by Ray head node restarts were no longer acceptable. In this post, we dive into our migration from the RayCluster CRD to the RayService CRD, which enabled rolling updates, external GCS storage, and more. We share how we tackled challenges such as unpredictable deployments, slow Ray worker nodes start-up, and ensuring high availability with Dragonfly. If you want to understand how to make Ray workloads more resilient, predictable, and production-ready on Kubernetes, this post walks through our practical solutions and lessons learned.

