Talent.com
ML Cluster Operations Engineer
ML Cluster Operations EngineerTensorwave • Las Vegas, Nevada, United States
ML Cluster Operations Engineer

ML Cluster Operations Engineer

Tensorwave • Las Vegas, Nevada, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

ML Cluster Operations Engineer (Slurm / K8s)

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role :

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and / or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities :

Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.

Work closely with our engineering team to develop and maintain CI and automation for managed offerings.

Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.

Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.

Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.

Mentor and upskill ML engineers in best practices.

Qualifications : Must-Have :

5+ years of experience in cloud infrastructure, HPC, or machine learning roles.

Significant hands-on experience with Slurm in production HPC / ML environments, including understanding of setup / configuration, enroot (pyxis), modules, and MPI.

Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.

Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.

Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have :

3+ years of hands-on Kubernetes experience , including deep knowledge of the Kubernetes API, internals, networking, and storage.

Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.

Experience with DAGs using K8s native tools such as Argo Workflows.

Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.

Experience with low level kernel libraries, such as CUDA and Composable Kernel.

Contributions to open-source projects or ML / AI tooling.

What Success Looks Like

A production-grade integrated Slurm platform that can support thousands of GPUs , with self-healing, scaling, and strong observability.

Infrastructure is resilient, secure, resource-optimized, and compliant.

Best practices and tooling are well-documented, standardized, and continuously improved across the company.

Make GPUs go Brrrrrrr

What We Bring : Stock Options

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health

serp_jobs.job_alerts.create_a_job

Ml Engineer • Las Vegas, Nevada, United States

Job_description.internal_linking.related_jobs
Datacenter Project Manager (Las Vegas)

Datacenter Project Manager (Las Vegas)

Astreya • Las Vegas, NV, US
serp_jobs.job_card.part_time
The Project Manager (PM) is responsible for overseeing the planning, execution, and delivery of complex networking and Data Center (DC) projects across multiple client environments.This role overse...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Multi-Unit Team Leader

Multi-Unit Team Leader

Las Vegas Staffing • Las Vegas, NV, US
serp_jobs.job_card.full_time +1
H&R Block Multi-Unit Team Leader.At H&R Block, we believe in the power of people helping people.Our defining purpose is to provide help and inspire confidence in our clients, associates, and commun...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau

Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau

Paysign, Inc. • Henderson, NV, United States
serp_jobs.job_card.full_time
Design and maintain enterprise-scale data pipelines using AWS cloud services, handling schema evolution in data feeds and delivering analytics-ready datasets to BI platforms.This role requires hand...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Platform Engineer III

Platform Engineer III

Credit One Bank • Las Vegas, NV, US
serp_jobs.job_card.full_time
We are seeking a highly experienced Platform Engineer to serve as a senior technical leader in our containerized environments. This role will be primarily responsible for designing, implementing, an...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Senior Chief Engineer

Senior Chief Engineer

Beasley Media Group • Las Vegas, NV, US
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
Beasley Media Group has an opening for a Senior Chief Engineer in the Las Vegas, NV market.This is an opportunity to work in a modern facility and live in beautiful Las Vegas, NV.We are looking for...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days
Lead Software Engineer

Lead Software Engineer

Relativity • Las Vegas, NV, United States
serp_jobs.job_card.full_time
We are seeking a Lead Software Engineer to join the Retrieval Ingestion Team at Relativity.This role is ideal for an experienced engineer who thrives on designing and operating high throughput inge...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Project Engineer (Las Vegas)

Project Engineer (Las Vegas)

SR Construction, Inc. • Las Vegas, Nevada, United States
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
Are you an analytical thinker who loves to solve tangible problems?.Are you great at seeing big picture objectives and instinctually knowing the fastest route to get there?.Do people who talk too m...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Senior DevOps Engineer

Senior DevOps Engineer

Jobot • Las Vegas, NV, US
serp_jobs.job_card.full_time
REMOTE Senior Site Reliability Engineer / Senior Dev Ops Engineer Needed for Growing Fintech Startup!.This Jobot Job is hosted by : Reed Kellick. Are you a fit? Easy Apply now by clicking the "Apply ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Workplace Tech Sr Engineer I

Workplace Tech Sr Engineer I

Allegiant Air • Las Vegas, Nevada, United States, 89101
serp_jobs.job_card.full_time
SummaryThe Senior Engineer I Mobility is responsible for the design, deployment, and administration of mobile device management (MDM) solutions, primarily using Jamf for Apple devices and Microsof...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Systems Engineer

Systems Engineer

Scientific Research Corporation • North Las Vegas, NV, United States
serp_jobs.job_card.full_time
Estimated Starting Salary Range : USD $62,550.Semi-Monthly Salary to be determined by the education, experience, knowledge, skills, and abilities of the applicant, internal equity, and alignment wit...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Operations Manager

Operations Manager

Nevada Staffing • Las Vegas, NV, US
serp_jobs.job_card.full_time
Our team has developed a robust culture of safety, professionalism and commitment to Diversity, Equity and Inclusion (DEI). We expect all team members to champion Company standards of conduct and ou...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Agent Engineer

Senior Agent Engineer

Reflex Media, Inc. • Las Vegas, NV, US
serp_jobs.job_card.full_time
serp_jobs.filters_job_card.quick_apply
We are seeking a talented and innovative Senior AI Agent Engineer to join our team.In this role, you will be at the forefront of building intelligent systems that can autonomously reason, make deci...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
iSeries Engineer (Las Vegas)

iSeries Engineer (Las Vegas)

Taurean Consulting Group, Inc • Las Vegas, NV, US
serp_jobs.job_card.part_time +1
Taurean Consulting Group is a 100% Woman-Owned IT Staffing and Project Solutions company built on deep relationships.With over 25 years of experience in Technology Staffing, we match candidates to ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Data Center Operations Systems Chief Engineer

Data Center Operations Systems Chief Engineer

Jobot • Las Vegas, NV, US
serp_jobs.job_card.full_time
Wanted : Data Center Operations Systems Chief Engineer with 5 years of technical experience in production, facilities, electrical, or mechanical! Let's meet. This Jobot Job is hosted by : Stephen Nied...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau (Henderson)

Senior Technical Reporting Manager; ETL Reporting Engineer - AWS Data Platform / Tableau (Henderson)

Paysign, Inc. • Henderson, NV, US
serp_jobs.job_card.part_time
Design and maintain enterprise-scale data pipelines using AWS cloud services, handling schema evolution in data feeds and delivering analytics-ready datasets to BI platforms.This role requires hand...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Datacenter Project Manager

Datacenter Project Manager

Astreya • Las Vegas, NV, US
serp_jobs.job_card.full_time
The Project Manager (PM) is responsible for overseeing the planning, execution, and delivery of complex networking and Data Center (DC) projects across multiple client environments.This role overse...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Onsite Supervisor

Onsite Supervisor

Arrow Workforce Solutions • Las Vegas, NV, United States
serp_jobs.job_card.full_time
Location - North Las Vegas, NV, 89030.We’re looking for a motivated and an Onsite Supervisor to oversee daily workforce operations at client facilities. This role is ideal for someone with staffing ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
L1 / L2 Technical Support Engineer (Las Vegas)

L1 / L2 Technical Support Engineer (Las Vegas)

Ledgent Technology • Las Vegas, NV, US
serp_jobs.job_card.part_time +1
We are hiring a L1 / L2 Technical Support Engineer to start immediately.This is a 3+ months contract position, likely to extend based on performance and project needs. Monday to Friday, 9 AM to 5 PM w...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted