Loading…
In-person + Virtual
November 6-9
Learn More and Register to Attend

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon North America 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

Please note: This schedule is automatically displayed in Central Standard Time (UTC -6). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Back To Schedule
Tuesday, November 7 • 5:25pm - 6:00pm
On-Demand Systems and Scaled Training Using the JobSet API - Abdullah Gharaibeh, Google & Vanessa Sochat, Lawrence Livermore National Laboratory

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Orchestrating complex workflows with heterogeneous components presents challenges that are compounded in ephemeral environments. For example, training of large ML models requires efficiently managing a significant number of expensive accelerators, and building on-demand HPC systems can mean composing applications and services. For both, efficient job orchestration is critical to ensure scalability and high resource utilization. This talk introduces the JobSet API (sigs.k8s.io/jobset) that lays the foundation to automate the setup of these designs. We will first demonstrate how JobSet is used to deploy training workloads using common frameworks like Pytorch, and present results from large scale training experiments on thousands of TPU chips. We then show using JobSet to automate the arduous task of setting up HPC systems on-demand, and creating common environments for experimental comparison.

Speakers
avatar for Abdullah Gharaibeh

Abdullah Gharaibeh

Staff Software Engineer, Google
Abdullah is a staff software engineer at Google and sig-scheduling and working group batch co-chair. He works on Kubernetes and Google Kubernetes Engine, focusing on scheduling and batch workloads.
avatar for Vanessa Sochat

Vanessa Sochat

Computer Scientist, Lawrence Livermore National Laboratory
Vanessa is a Computer Scientist at Lawrence Livermore National Laboratory, and a software engineer for fifteen years. She received her PhD from Stanford University, and has done extensive work on container technologies, developer tools, and fostering open source communities. She founded... Read More →



Tuesday November 7, 2023 5:25pm - 6:00pm CST
W375CD (Level 3)
  Emerging + Advanced