Talent.com
Product Infrastructure Engineer - Site Reliability
Product Infrastructure Engineer - Site ReliabilityZyphra • Palo Alto, California, United States
Product Infrastructure Engineer - Site Reliability

Product Infrastructure Engineer - Site Reliability

Zyphra • Palo Alto, California, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Zyphra is an artificial intelligence company based in Palo Alto, California.

The Role :

As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.

You’ll work across :

Building and improving observability systems (monitoring, logging, alerting)

Designing resilient build and deployment systems across research and production environments

Implementing secure release processes with strong auditability and rollback support

Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance

Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive

Requirements :

Experience in high-performance compute environments, such as ML clusters or GPU farms

Background in infrastructure as code (e.g., Ansible, Terraform)

Familiarity with software release engineering with for ML / AI systems is a plus

Experience designing reliable environments for experimental workloads and reproducible runs

Knowledge of compliance and audit standards in deployment and system security

Experience with load testing, fault injection, and chaos engineering to harden systems under stress

Passion for building tooling that makes infrastructure invisible and reliable for end users

Bonus Qualifications :

Experience with infrastructure as code (e.g., Ansible, Terraform)

Prior work supporting ML / AI infrastructure, including GPU management and workload optimization

Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Experience working with cloud platforms such as AWS, Azure, or GCP

Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra :

Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued

We strongly value new and crazy ideas and are very willing to bet big on new ideas

We move as quickly as we can; we aim to minimize the bar to impact as low as possible

We all enjoy what we do and love discussing AI

Benefits and Perks :

Comprehensive medical, dental, vision, and FSA plans

Competitive compensation and 401(k)

Relocation and immigration support on a case-by-case basis

On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

serp_jobs.job_alerts.create_a_job

Site Reliability Engineer • Palo Alto, California, United States

Job_description.internal_linking.related_jobs
Principal Site Reliability Engineer (SASE)

Principal Site Reliability Engineer (SASE)

Palo Alto Networks • Cupertino, California, United States
serp_jobs.job_card.full_time
At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Kubernetes & Infrastructure Engineer

Senior Kubernetes & Infrastructure Engineer

Third Wave Automation • Union City, California, United States
serp_jobs.job_card.full_time
Third Wave Automation is a rapidly growing startup that has demonstrated its core technology components, proven its market fit, and just closed its Series C funding. If you are excited about cutting...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Sr. Site Reliability Engineer

Sr. Site Reliability Engineer

Globality • Palo Alto, California, United States
serp_jobs.job_card.full_time
Joel Hyatt and Lior Delgo founded Globality with a vision to create prosperous and healthy economies, companies, communities, and individuals. In this new era of the Autonomous Enterprise, Globality...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
CDN Site Reliability Engineer (SRE) L5

CDN Site Reliability Engineer (SRE) L5

Netflix • Los Gatos, California, United States
serp_jobs.job_card.full_time
Netflix is one of the world's leading entertainment services, with 283 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and lan...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Technology Site Reliability Engineer

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Senior / Staff Site Reliability Engineer

Senior / Staff Site Reliability Engineer

Gatik Ai • Mountain View, California, United States
serp_jobs.job_card.full_time
Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent del...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States
serp_jobs.job_card.full_time
PsiQuantum'smission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has be...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer, Telecom & SMS

Staff Site Reliability Engineer, Telecom & SMS

Ez Texting • San Jose, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
Who We Are EZ Texting is a recognized leader in text message marketing for small and medium-sized businesses and organizations, setting the standard for professional texting.Our messaging solutions...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Staff Machine Learning Engineer - Site Reliability Engineer

Senior Staff Machine Learning Engineer - Site Reliability Engineer

Servicenow • Santa Clara, California, United States
serp_jobs.job_card.full_time +1
It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Psiquantum • Palo Alto, California, United States
serp_jobs.job_card.full_time
Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Infrastructure Engineer (Core Infra, US)

Senior Infrastructure Engineer (Core Infra, US)

Workato • Palo Alto, California, United States
serp_jobs.job_card.full_time
Workato transforms technology complexity into business opportunity.As the leader in enterprise orchestration, Workato helps businesses globally streamline operations by connecting data, processes, ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Sr. Reliability Engineer (26861)

Sr. Reliability Engineer (26861)

Supermicro • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Natcast • Sunnyvale, California, United States
serp_jobs.job_card.full_time
Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer (SRE) - Media Production Infrastructure

Site Reliability Engineer (SRE) - Media Production Infrastructure

Monks • Cupertino, California, United States
serp_jobs.job_card.full_time
Please note that we will never request payment or bank account information at any stage of the recruitment process.As we continue to grow our teams, we urge you to be cautious of fraudulent job pos...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Tarana Wireless • Milpitas, California, United States
serp_jobs.job_card.full_time
Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Zscaler • San Jose, California, United States
serp_jobs.job_card.full_time
Zscaler accelerates digital transformation so our customers can be more agile, efficient, resilient, and secure.Our cloud native Zero Trust Exchange platform protects thousands of customers from cy...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer - Supercomputing

Site Reliability Engineer - Supercomputing

Xai • Palo Alto, California, United States
serp_jobs.job_card.full_time
AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Paynearme • Cupertino, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted