Infrastructure engineer serp_jobs.h1.location_city
serp_jobs.job_alerts.create_a_job
Infrastructure engineer • san mateo ca
Software Engineer, Infrastructure
OpenAISan Francisco; SeattlePrincipal Software Engineer, Data Infrastructure
RobloxSan Mateo, CA, United StatesSenior Staff Engineer, Core Infrastructure
StripeSouth San Francisco- serp_jobs.job_card.promoted
ML Infrastructure Engineer (Staff / Principal)
Menlo VenturesBurlingame, CA, United States- serp_jobs.job_card.promoted
Machine Learning Engineer — Infrastructure
Fundamental Research LabsMenlo Park, CA, United StatesSoftware Engineer, Infrastructure
DatologyaiRedwood City, California, United StatesSoftware Engineer, Frontend Infrastructure
ReplitFoster City, California, United StatesSenior Infrastructure Engineer
CrunchbaseCalifornia, United States- serp_jobs.job_card.promoted
Infrastructure Admin
CareDx, Inc.Brisbane, CA, USPrincipal Software Engineer, ML Infrastructure
ZooxFoster City, CASenior Backend Engineer Infrastructure & DevOps
C3 AiRedwood City, California, United StatesSenior Infrastructure Engineer | VMware Architect (Menlo Park, CA) #4263
GrailMenlo Park, California, United StatesML Infrastructure Engineer
PhizenixMenlo Park, California, United StatesAzure Infrastructure Engineer
MassGenicsSan Mateo , CaliforniaAzure Infrastructure Engineer
Innova SolutionsSan Mateo , CaliforniaStaff Software Engineer, Database Infrastructure
BoxRedwood City, CaliforniaStaff Software Engineer, AI / ML Infrastructure
Chan Zuckerberg InitiativeRedwood City, CA (Hybrid)Data Infrastructure Engineer
zaimlerSan Mateo, California, United StatesAI Observabiity Infrastructure Engineer
SnowflakeMenlo Park, California, United StatesSenior HPC Engineer, Infrastructure Specialist Team
NVIDIARemote, CA, USSoftware Engineer, Infrastructure
OpenAISan Francisco; Seattle- serp_jobs.job_card.full_time
About the Team
We’re hiring Software Engineers to join our broader Infrastructure organization, which supports multiple high-impact teams. Depending on your interests and experience, you could work on one of several focus areas—including Core Distributed Systems, Reliability Engineering, Observability, Developer Productivity or Cloud Infrastructure.
About the Role
All teams are deeply collaborative, work on mission-critical services, and are responsible for building distributed, scalable infrastructure to bring OpenAI’s technology to the world through products like ChatGPT and the OpenAI API. You’ll work closely with stakeholders to understand infrastructure, data and compute needs, setting the technical strategy that supports cutting-edge research and product development. This is a critical role for someone who is passionate about solving complex engineering problems at scale, ensuring their performance, scalability and reliability
Team Focus Areas
Distributed Systems : Owning and building important, highly scalable, available, performant, and reliable distributed systems (and their building blocks) to power the entire stack at OpenAI
Systems Engineering : Work across layers of the stack—debugging system bottlenecks, evolving core infrastructure, and solving novel problems in performance and scalability.
Reliability Engineering : Build scalable, fault-tolerant systems and lead efforts around service health, incident response, and resilience.
Observability : Design and maintain observability tooling (metrics, logs, tracing) to give teams visibility into production systems at scale.
Developer Productivity : Create tools, environments, and workflows that help engineers ship high-quality software faster and more safely.
Cloud Infrastructure : Own the cloud-native infrastructure (compute, networking, storage) that underpins all services and research workloads.
In this role you will :
Design, build, and maintain reliable and performant systems used across engineering.
Work with your team to define technical strategy, architecture, and long-term goals.
Collaborate with other engineers, product managers, and researchers to build infrastructure that meets evolving needs.
Improve internal tooling, automation, and developer experience.
Contribute to incident response, postmortems, and the development of best practices around system reliability and scalability.
You might thrive in this role if you :
Strong software engineering skills with experience in Python, Go, C++, Rust, or similar languages.
Experience designing, operating, or scaling distributed systems or developer infrastructure.
Comfort working in Linux environments, and with tools like Kubernetes, Terraform, CI / CD pipelines, and modern observability stacks.
Ability to navigate complex systems and a willingness to dig deep when debugging tricky issues.
Excellent communication and collaboration skills, especially in cross-functional settings.
Qualifications :
4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement.
Excellent communication skills, with ability to build consensus among stakeholders both internally and externally.