Talent.com
ML Cluster Operations Engineer
ML Cluster Operations EngineerTensorwave • Las Vegas, Nevada, United States
ML Cluster Operations Engineer

ML Cluster Operations Engineer

Tensorwave • Las Vegas, Nevada, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

ML Cluster Operations Engineer (Slurm / K8s)

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role :

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and / or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities :

Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.

Work closely with our engineering team to develop and maintain CI and automation for managed offerings.

Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.

Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.

Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.

Mentor and upskill ML engineers in best practices.

Qualifications : Must-Have :

5+ years of experience in cloud infrastructure, HPC, or machine learning roles.

Significant hands-on experience with Slurm in production HPC / ML environments, including understanding of setup / configuration, enroot (pyxis), modules, and MPI.

Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.

Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.

Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have :

3+ years of hands-on Kubernetes experience , including deep knowledge of the Kubernetes API, internals, networking, and storage.

Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.

Experience with DAGs using K8s native tools such as Argo Workflows.

Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.

Experience with low level kernel libraries, such as CUDA and Composable Kernel.

Contributions to open-source projects or ML / AI tooling.

What Success Looks Like

A production-grade integrated Slurm platform that can support thousands of GPUs , with self-healing, scaling, and strong observability.

Infrastructure is resilient, secure, resource-optimized, and compliant.

Best practices and tooling are well-documented, standardized, and continuously improved across the company.

Make GPUs go Brrrrrrr

What We Bring : Stock Options

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health

serp_jobs.job_alerts.create_a_job

Ml Engineer • Las Vegas, Nevada, United States

Job_description.internal_linking.related_jobs
Datacenter Project Manager (Las Vegas)

Datacenter Project Manager (Las Vegas)

Astreya • Las Vegas, NV, US
serp_jobs.job_card.part_time
The Project Manager (PM) is responsible for overseeing the planning, execution, and delivery of complex networking and Data Center (DC) projects across multiple client environments.This role overse...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Lead Software Engineer (Mobile SDK)

Lead Software Engineer (Mobile SDK)

Skillz • Las Vegas, NV, United States
serp_jobs.job_card.full_time
If you want to build, develop, and see your impact, join Skillz and level up your Career!.Skillz, the first publicly-traded mobile eSports platform that hosts billions of casual mobile gaming tourn...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau

Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau

Paysign, Inc. • Henderson, NV, United States
serp_jobs.job_card.full_time
Design and maintain enterprise-scale data pipelines using AWS cloud services, handling schema evolution in data feeds and delivering analytics-ready datasets to BI platforms.This role requires hand...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Platform Engineer III

Platform Engineer III

Credit One Bank • Las Vegas, NV, US
serp_jobs.job_card.full_time
We are seeking a highly experienced Platform Engineer to serve as a senior technical leader in our containerized environments. This role will be primarily responsible for designing, implementing, an...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Nuclear Engineer

Nuclear Engineer

US Navy • Henderson, Nevada, United States
serp_jobs.job_card.part_time
It takes hard work and smarts to get you into the reactor room.But if you have a strong interest in math, chemistry, physics and engineering, you might just have what it takes to be a Machinist's M...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Account Director II- Large Enterprise

Account Director II- Large Enterprise

Lumen Technologies • Henderson, NV, United States
serp_jobs.job_card.full_time
We are igniting business growth by connecting people, data and applications - quickly, securely, and effortlessly.Together, we are building a culture and company from the people up - committed to t...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Project Engineer (Las Vegas)

Project Engineer (Las Vegas)

SR Construction, Inc. • Las Vegas, Nevada, United States
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
Are you an analytical thinker who loves to solve tangible problems?.Are you great at seeing big picture objectives and instinctually knowing the fastest route to get there?.Do people who talk too m...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Lead Software Engineer

Lead Software Engineer

Relativity • Las Vegas, NV, United States
serp_jobs.job_card.full_time
We are seeking a Lead Software Engineer to join the Retrieval Ingestion Team at Relativity.This role is ideal for an experienced engineer who thrives on designing and operating high throughput inge...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Workplace Tech Sr Engineer I

Workplace Tech Sr Engineer I

Allegiant Air • Las Vegas, Nevada, United States, 89101
serp_jobs.job_card.full_time
SummaryThe Senior Engineer I Mobility is responsible for the design, deployment, and administration of mobile device management (MDM) solutions, primarily using Jamf for Apple devices and Microsof...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Operations Manager

Operations Manager

Nevada Staffing • Las Vegas, NV, US
serp_jobs.job_card.full_time
Our team has developed a robust culture of safety, professionalism and commitment to Diversity, Equity and Inclusion (DEI). We expect all team members to champion Company standards of conduct and ou...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Lead Azure DevOps Engineer

Lead Azure DevOps Engineer

Equiliem • Las Vegas, NV, United States
serp_jobs.job_card.full_time
Leadership : Lead and mentor a team of DevOps engineers, fostering a collaborative and high-performing culture.Strategy : Develop and implement a robust DevOps strategy aligned with the organization'...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Agent Engineer

Senior Agent Engineer

Reflex Media, Inc. • Las Vegas, NV, US
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
We are seeking a talented and innovative Senior AI Agent Engineer to join our team.In this role, you will be at the forefront of building intelligent systems that can autonomously reason, make deci...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Systems Engineer

Systems Engineer

Scientific Research Corporation • North Las Vegas, NV, United States
serp_jobs.job_card.full_time
Estimated Starting Salary Range : USD $62,550.Semi-Monthly Salary to be determined by the education, experience, knowledge, skills, and abilities of the applicant, internal equity, and alignment wit...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Azure DevOps Engineer

Azure DevOps Engineer

Purple Drive • Las Vegas, NV, United States
serp_jobs.job_card.full_time
The Azure DevOps Engineer III is responsible for designing, building, maintaining, and deploying infrastructure and applications using automation and orchestration throughout the entire software de...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau (Henderson)

Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau (Henderson)

Paysign, Inc. • Henderson, NV, US
serp_jobs.job_card.part_time
Design and maintain enterprise-scale data pipelines using AWS cloud services, handling schema evolution in data feeds and delivering analytics-ready datasets to BI platforms.This role requires hand...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Datacenter Project Manager

Datacenter Project Manager

Astreya • Las Vegas, NV, US
serp_jobs.job_card.full_time
The Project Manager (PM) is responsible for overseeing the planning, execution, and delivery of complex networking and Data Center (DC) projects across multiple client environments.This role overse...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Military Operations SME (Software Engineer)

Military Operations SME (Software Engineer)

Huntington Ingalls Industries • Nellis Air Force Base, NV, United States
serp_jobs.job_card.full_time
Employment Type : Full Time / Salaried / Exempt.Anticipated Salary Range : $100,000.Meet HII's Mission Technologies Division. Our team of more than 7,000 professionals worldwide delivers all-domain expert...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Advanced Software Engineer

Advanced Software Engineer

Relativity • Las Vegas, NV, United States
serp_jobs.job_card.full_time
As an Advanced Software Engineer at Relativity, you will use your development expertise, working on software projects to build our software platform, Relativity. You will help solve complex problems...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted