This person needs to be local to Indianapolis and available to come into the office 2-3 days / week.
Summary
We're seeking an experienced Infrastructure Architect to design, implement, and optimize NVIDIA DGX environments with a specialized focus on Run : ai orchestration. This role requires deep expertise in GPU-accelerated infrastructure and AI workload management to maximize resource efficiency and scalability.
Key Responsibilities
Architect DGX Solutions : Design and deploy NVIDIA DGX infrastructure. This role will primarily focus on solutions centered around the DGX B300 platform, but strong experience with previous generations, such as the DGX H100 and H200, is highly relevant and valued. A key aspect of this role will be integrating these DGX solutions with Run : ai for dynamic GPU orchestration.
Run : ai Implementation : Configure and manage Run : ai’s AI-native scheduling, resource pooling, and policy engine to optimize GPU utilization across hybrid environments (on-premises, cloud, edge)
Lifecycle Management : Oversee end-to-end AI workflows—from data preparation and model training to deployment—using Run : ai’s unified platform
Access Control : Implement and maintain role-based access control (RBAC) using Run : ai’s predefined roles (e.g., System Admin, Department Admin) and scope-based permissions
Performance Optimization : Monitor and tune cluster performance using Run : ai’s observability tools, ensuring maximal GPU throughput and minimal idle time
Cross-functional Collaboration : Partner with data science and IT teams to align infrastructure capabilities with AI project requirements
Required Qualifications
Technical Expertise :
10+ Years experience in Linux Advanced Compute environments
Proficiency in NVIDIA DGX systems and Kubernetes-based orchestration.
Hands-on experience with Run : ai’s dynamic scheduling, policy engine, and KAI Scheduler
Familiarity with hybrid / multi-cloud GPU resource management (AWS, GCP, Azure).
Operational Skills :
Ability to configure RBAC scopes (departments, projects) and workload prioritization in Run : ai
Experience optimizing distributed AI training and inference workloads.
Proactive Outreach : Initiate and maintain contact with NVIDIA technical teams on ongoing basis
Clear Communication : Ensure clear and consistent communication channels for discussions related to bugs, technical updates, and other issues.
Certifications : NVIDIA DGX System or Run : ai certification preferred.
Preferred Experience
Deploying Run : ai in large-scale AI factories with 100+ GPUs.
Managing NVIDIA AI Enterprise software stacks.
Integrating Run : ai with MLOps pipelines for automated resource provisioning
Familiar with NVIDIA Mission Control AI factory management platform (includes NVIDIA Base Command Manager, Run : ai and software including Autonomous Job Recovery, On-Demand Health Checks, Customizable dashboards)
Familiar with SLURM : bare-metal or containerized access to the compute infrastructure.
Experience with Spectrum-X is a plus
Infrastructure Architect • Indianapolis, IN