Talent.com
Distinguished Engineer, Observability, Monitoring, and Remediation
Distinguished Engineer, Observability, Monitoring, and RemediationNVIDIA • Santa Clara, CA, United States
Distinguished Engineer, Observability, Monitoring, and Remediation

Distinguished Engineer, Observability, Monitoring, and Remediation

NVIDIA • Santa Clara, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

NVIDIA is leading the industry in delivering accelerated computing in cloud and enterprise environments. We’re a team of innovative engineers dedicated to solving some of the world’s biggest challenges, constantly driving advancements, and impacting millions of lives worldwide!

As a technology leader at NVIDIA, you will own the development of DGX Cloud strategy for observability, monitoring, and remediation across all layers of infrastructure, IaaS, platforms and applications. You will define and drive the technical strategy for collecting, storing, analyzing, and model development related to observability data. Including defining the technical strategy for issue detection and notification. Most importantly, you will develop auto-remediation strategies to detect, fix, validate, and restore-to-service, components across all layers and in multiple systems. You will work with NVIDIA leadership multi-functionally to build accelerated computing infrastructure that enables customers with the highest availability and operational standards, without compromising on performance or developer experience.

What You’ll Be Doing :

Various Architectural Work : define and drive the technical implementation for DGX Cloud offerings in the observability, monitoring, and remediation practice.

Collaborate on Cross Domain Disciplines : drive awareness and technical strategy for the technical capabilities into DGX Cloud engineering practices.

Accelerate Integration : Guide the technical delivery into DGX Cloud systems across all delivery environments : enterprise, public cloud, and high security / isolated deployments.

Engage with External Stakeholders : Collaborate with customers, infrastructure providers, and strategic partners to ensure NVIDIA’s solutions set industry standards for availability and operational excellence of accelerated computing infrastructure and platforms.

Enablement : Ensure that DGX Cloud, our customers, and partners achieve operational excellence across all environments.

Full Software and System Lifecycle : From ideation, design, and development to continuous deployment, operations, and full lifecycle management. Lead all technical aspects of planning and continuous evolution of large technical scope.

What We Need to See :

18+ overall years in technical roles with a focus on observability and monitoring for cloud infrastructure, platforms, and applications.

5+ years of lead experience

BS / MS or higher or equivalent experience in systems / software engineering, or related engineering fields

Technical proficiency in multi-tenant data center and cloud-native architectures, with bare metal, virtualization, containerization, and higher level abstractions (IaaS, Kubernetes, Slurm), AI / ML platforms and applications.

Shown success delivering high-impact technically sophisticated solutions that achieve high levels of transparency into resource utilization, performance, and operational insights.

Technical Leadership : Ability to synthesize multi-functional needs into architecture and design while guiding internal execution across complementary teams.

Communication and Partnership : Strong collaboration and influence skills, capable of leading engineering engagement, presenting with peers, partners, and working with high performance accelerated computing customers.

Ways to Stand Out from the Crowd :

Application of Artificial Intelligence : Real world experience applying model development, RAG, MCP, and Agentic AI technical solutions to the problem of observability data analytics, issue identification and remediation.

Industry Expertise : Direct experience in designing, developing, delivering and operating highly available scaled (up / out) systems in enterprise and cloud environments.

Engineering Enablement : History of creating scalable processes and extensible systems that facilitate observability, monitoring, remediation as foundational capabilities employed by engineers building infrastructure, IaaS, platforms, and applications.

Open Source Collaboration : Familiarity with open source ecosystems and projects in the observability and monitoring space. Ability to collaborate and influence in open source project governance, represent NVIDIA customers and partners interest in technical alignment and direction.

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, passionate and self-motivated, we want to hear from you!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 308,000 USD - 471,500 USD.

You will also be eligible for equity and benefits () .

Applications for this job will be accepted at least until August 28, 2025.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

serp_jobs.job_alerts.create_a_job

Distinguished Engineer • Santa Clara, CA, United States

Job_description.internal_linking.related_jobs
Power Supply Design Engineer

Power Supply Design Engineer

Supermicro • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Director, Hardware Engineering

Director, Hardware Engineering

Cisco Systems, Inc. • San Jose, CA, United States
serp_jobs.job_card.full_time
The application window is expected to close on : Dec 30th, 12 PM ET 2025.Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.Our team ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Avionics Hardware Engineer

Avionics Hardware Engineer

Reliable Robotics • Mountain View, CA, United States
serp_jobs.job_card.permanent
We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Remote Nuclear Technicians - AI Trainer ($100-$200 per hour)

Remote Nuclear Technicians - AI Trainer ($100-$200 per hour)

Mercor • Santa Cruz, California, US
serp_jobs.filters.remote
serp_jobs.job_card.full_time
Mercor is recruiting • •Nuclear Technicians • • as independent contractors working on a research project • •for one of the world’s top AI companies. This project involves using your professional experie...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
Distinguished Engineer Cyber, IAM

Distinguished Engineer Cyber, IAM

Capital One • San Jose, California, USA
serp_jobs.job_card.full_time +1
Distinguished Engineer - Cyber IAM.As a Distinguished Engineer at Capital One you will be a part of a community of technical experts working to define the future of banking in the cloud.You will wo...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
CT Technologist

CT Technologist

HealthCare Traveler Jobs • Santa Cruz, CA, United States
serp_jobs.job_card.full_time
We are committed to provide unparalleled service and we will do whatever we can to ensure your assignment is as pleasant as possible. If you are interested in this position, please contact your recr...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Distinguished Engineer - Cyber, IAM

Distinguished Engineer - Cyber, IAM

Capital One National Association • San Jose, CA, United States
serp_jobs.job_card.full_time
Distinguished Engineer - Cyber, IAM.As a Distinguished Engineer at Capital One, you will be a part of a community of technical experts working to define the future of banking in the cloud.You will ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
Vacuum Equipment Engineer

Vacuum Equipment Engineer

PsiQuantum • Milpitas, CA, United States
serp_jobs.job_card.full_time
PsiQuantum'smission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has be...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Tarana Wireless • Milpitas, California, United States
serp_jobs.job_card.full_time
Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Distinguished Engineer

Distinguished Engineer

CommScope • Sunnyvale, California, USA
serp_jobs.job_card.full_time
In our always on world we believe its essential to have a genuine connection with the work you do.Our industry-leading Access Point portfolio sets the standard for excellence.Today were taking Wi-F...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Distinguished Engineer

Distinguished Engineer

GEICO • San Jose, CA, United States
serp_jobs.job_card.full_time
At GEICO, we offer a rewarding career where your ambitions are met with endless possibilities.Every day we honor our iconic brand by offering quality coverage to millions of customers and being the...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Distinguished, Software Engineer - Observability

Distinguished, Software Engineer - Observability

Walmart • Sunnyvale, CA, United States
serp_jobs.job_card.full_time +1
As an observability Distinguished Engineer, you will be a key researcher and technical lead expert in the architecture and development of cloud native observability designs, managed services, and r...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Technology Vulnerability Management Engineer

Technology Vulnerability Management Engineer

Cooley LLP • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Technology Vulnerability Management Engineer.Cooley is seeking a Technology Vulnerability Management Engineer to join the Security team. Cooley Technology embraces a culture of customer service exce...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Sr. R&D Engineer, Instruments

Sr. R&D Engineer, Instruments

Calyxo, Inc. • Pleasanton, CA, United States
serp_jobs.job_card.full_time
The company was founded in 2016 to address the profound need for improved kidney stone treatment.Kidney stone disease is a common, painful condition that consumes vast amounts of healthcare resourc...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Synthesis, Place & Route Distinguished Engineer

Synthesis, Place & Route Distinguished Engineer

Lattice Semiconductor • San Jose, CA, United States
serp_jobs.job_card.full_time
Synthesis, Place & Route Distinguished Engineer.Be among the first 25 applicants.Synthesis, Place & Route Distinguished Engineer. There is energy here…energy you can feel crackling at any of our int...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Geotechnical Engineer

Geotechnical Engineer

Jobot • Pleasanton, CA, US
serp_jobs.job_card.full_time
Long time established Geotechnical engineering firm seeks Engineer!.This Jobot Job is hosted by : Aaron Erickson.Are you a fit? Easy Apply now by clicking the "Apply Now" button and sending us your ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Border Patrol Agent

Border Patrol Agent

U.S. Customs and Border Protection • Mount Hermon, California, US
serp_jobs.job_card.full_time +1
Border Patrol Agent (BPA) Entry Level.Check out the role overview below If you are confident you have got the right skills and experience, apply today. NEW RECRUITMENT AND RETENTION INCENTIVES!.Duty...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
HVAC Infrastructure Engineer

HVAC Infrastructure Engineer

Foxconn Industrial Internet • San Jose, California, United States
serp_jobs.job_card.full_time
We are seeking a highly skilled and motivated HVAC Infrastructure Engineer to join our dynamic team.The ideal candidate will be responsible for the design, maintenance, and construction of data cen...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted