Talent.com
Senior AI and ML HPC Cluster Engineer
Senior AI and ML HPC Cluster EngineerNVIDIA • Santa Clara, CA, United States
serp_jobs.error_messages.no_longer_accepting
Senior AI and ML HPC Cluster Engineer

Senior AI and ML HPC Cluster Engineer

NVIDIA • Santa Clara, CA, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can tackle, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Make the choice to join us today!

As a member of the GPU AI / HPC Infrastructure team, you will provide leadership in the design and implementation of ground breaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a technical leader to identify architectural changes and / or completely new approaches for our GPU Compute Clusters. As an expert, you will help us with the strategic challenges we encounter including : compute, networking, and storage design for large scale, high performance workloads, effective resource utilization in a heterogeneous compute environment, evolving our private / public cloud strategy, capacity modeling, and growth planning across our global computing environment.

What you'll be doing :

Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.

Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions

Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud

Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs

Support our researchers to run their workloads including performance analysis and optimizations

Conduct root cause analysis and suggest corrective action Proactively find and fix issues before they occur

What we need to see :

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience

Minimum 5+ years of experience designing and operating large scale compute infrastructure

Experience with AI / HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF

Proficient in administering Centos / RHEL and / or Ubuntu Linux distributions

Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt

In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud

Proficiency in Python programming and bash scripting

Applied experience with AI / HPC workflows that use MPI

Experience analyzing and tuning performance for a variety of AI / HPC workloads.

Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI / ML infrastructure fields.

Ways to stand out from the crowd :

Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking

Experience with Machine Learning and Deep Learning concepts, algorithms and models

Familiarity with InfiniBand with IBOP and RDMA

Understanding of fast, distributed storage systems like Lustre and GPFS for AI / HPC workloads

Familiarity with deep learning frameworks like PyTorch and TensorFlow

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 136,000 USD - 212,750 USD for Level 3, and 168,000 USD - 264,500 USD for Level 4.

You will also be eligible for equity and benefits () .

Applications for this job will be accepted at least until October 22, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

serp_jobs.job_alerts.create_a_job

Ai Ml • Santa Clara, CA, United States

Job_description.internal_linking.related_jobs
Sr. Software Engineer - AI / LLM Applications (26456)

Sr. Software Engineer - AI / LLM Applications (26456)

Supermicro • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior AI / ML Runtime Engineer (HPC & Infra)

Senior AI / ML Runtime Engineer (HPC & Infra)

Google Inc. • Sunnyvale, CA, United States
serp_jobs.job_card.full_time
A leading technology company in Sunnyvale, CA, is seeking a Senior Software Engineer specializing in AI / ML.This role involves developing state-of-the-art ML infrastructures and collaborating on key...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
AI / ML Architect

AI / ML Architect

Cooley LLP • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Cooley is seeking an AI / ML Architect to join the Practice Engineering team within the Innovation department.As a leading technology law firm, Cooley is determined to become a leader in the digital ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Lead AI Engineer

Lead AI Engineer

1Five • Hayward, CA, United States
serp_jobs.job_card.full_time
This is a leadership role at the intersection of.AI, technical architecture, and company vision.ML engineering and model development. Backflip’s core model, including architecture, data, training, a...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_1_hour • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior AI / ML Platform Architect — Gen AI on AWS

Senior AI / ML Platform Architect — Gen AI on AWS

JPMorganChase • Palo Alto, California, United States
serp_jobs.job_card.full_time
A leading financial services firm is looking for a Principal AI / ML and Gen AI Engineer in Palo Alto, California.This role focuses on designing and scaling AI / ML infrastructure using AWS.The ideal c...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior AI Platform Engineer — Hybrid (ML Infra)

Senior AI Platform Engineer — Hybrid (ML Infra)

General Motors of Canada • Mountain View, CA, United States
serp_jobs.job_card.full_time
A leading automotive company seeks a Senior Machine Learning Engineer to drive AI innovation by designing and building advanced AI technologies. You'll play a vital role in shaping the company's fut...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior Lead AI Engineer (AI Foundations, LLM Core and Agentic AI)

Senior Lead AI Engineer (AI Foundations, LLM Core and Agentic AI)

Capital One • San Jose, CA, United States
serp_jobs.job_card.full_time
Senior Lead AI Engineer (AI Foundations, LLM Core and Agentic AI).At Capital One, we are creating responsible and reliable AI systems, changing banking for good. For years, Capital One has been an i...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
AI Solutions Architect : On-Prem & Cloud ML Deployments

AI Solutions Architect : On-Prem & Cloud ML Deployments

7wdata • Santa Clara, CA, United States
serp_jobs.job_card.full_time
A technology company is seeking a Machine Learning Engineer / Solution Architect with expertise in deploying deep learning models on-prem and in the cloud. Responsibilities include technical engagemen...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior ML Engineer : Testing Platform & Production AI

Senior ML Engineer : Testing Platform & Production AI

Relha LLC • San Jose, CA, United States
serp_jobs.job_card.full_time
A financial technology company in California is seeking a Machine Learning Engineer to enhance testing efficiencies in product development. You will lead a team in developing machine learning models...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Director, AI / ML Forward Deployment & Systems

Director, AI / ML Forward Deployment & Systems

CareerArc • Santa Clara, CA, United States
serp_jobs.job_card.full_time
A leading technology company, located in Santa Clara, seeks an Engineering Leader to drive innovation in PC systems.The role involves leading cross-functional teams to implement cutting-edge techno...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Principal ML Engineer — Lead End-to-End AI / ML Platforms

Principal ML Engineer — Lead End-to-End AI / ML Platforms

Intuit Inc. • Mountain View, CA, United States
serp_jobs.job_card.full_time
A financial technology company is seeking a Principal Machine Learning Engineer in Mountain View, California.This role involves leading AI strategy and deploying AI / ML solutions across financial pr...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior AI Engineer (AI Foundations, LLM Core and Agentic AI)

Senior AI Engineer (AI Foundations, LLM Core and Agentic AI)

Capital One • San Jose, CA, United States
serp_jobs.job_card.full_time
Senior AI Engineer (AI Foundations, LLM Core and Agentic AI).At Capital One, we are creating responsible and reliable AI systems, changing banking for good. For years, Capital One has been an indust...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
AI / ML Engineer

AI / ML Engineer

Krane • Hayward, CA, United States
serp_jobs.job_card.full_time
Krane is building intelligent tools that power the future of construction operations.You’ll lead the design and deployment of intelligent systems that automate project documentation, streamline sup...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior ML Model Engineer - Production AI for Ads

Senior ML Model Engineer - Production AI for Ads

Samsung Ads • Mountain View, CA, United States
serp_jobs.job_card.full_time
A leading advertising technology company in Mountain View seeks a Mid-Senior level Machine Learning Model Engineer.In this role, you'll lead production-grade machine learning projects, design scala...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
Senior AIML Engineer : Multi-Modal Foundation Models

Senior AIML Engineer : Multi-Modal Foundation Models

NLP PEOPLE • Cupertino, CA, United States
serp_jobs.job_card.full_time
A tech company is seeking a Senior Machine Learning Engineer to design and implement algorithms for user-facing features. This role involves collaborating with engineering teams and leveraging multi...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Senior AI / ML Engineer

Senior AI / ML Engineer

General Motors • Mountain View, CA, United States
serp_jobs.job_card.full_time
At General Motors, our product teams are redefining mobility.Through a human-centered design process, we create vehicles and experiences that are designed not just to be seen, but to be felt.We’re ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior ML Platform Engineer for Large-Scale AI Infra

Senior ML Platform Engineer for Large-Scale AI Infra

Apple Inc. • Santa Clara, CA, United States
serp_jobs.job_card.full_time
A leading technology company in Santa Clara is seeking a Machine Learning Engineer to design and build large-scale distributed services that power their search and foundation model platforms.You wi...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior ML Engineer : Build Generative AI & Scalable Pipelines

Senior ML Engineer : Build Generative AI & Scalable Pipelines

Adobe • San Jose, CA, United States
serp_jobs.job_card.full_time
A leading digital experiences company in San Jose, California is seeking a Machine Learning Engineer 3 with expertise in statistical modeling and machine learning. The ideal candidate has a master's...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted