Talent.com
Product Infrastructure Engineer - Site Reliability
Product Infrastructure Engineer - Site ReliabilityZyphra • Palo Alto, California, United States
Product Infrastructure Engineer - Site Reliability

Product Infrastructure Engineer - Site Reliability

Zyphra • Palo Alto, California, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Zyphra is an artificial intelligence company based in Palo Alto, California.

The Role :

As a Infrastructure Engineer - Site Reliability , you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments.

You’ll work across :

Building and improving observability systems (monitoring, logging, alerting)

Designing resilient build and deployment systems across research and production environments

Implementing secure release processes with strong auditability and rollback support

Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance

Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive

Requirements :

Experience in high-performance compute environments, such as ML clusters or GPU farms

Background in infrastructure as code (e.g., Ansible, Terraform)

Familiarity with software release engineering with for ML / AI systems is a plus

Experience designing reliable environments for experimental workloads and reproducible runs

Knowledge of compliance and audit standards in deployment and system security

Experience with load testing, fault injection, and chaos engineering to harden systems under stress

Passion for building tooling that makes infrastructure invisible and reliable for end users

Bonus Qualifications :

Experience with infrastructure as code (e.g., Ansible, Terraform)

Prior work supporting ML / AI infrastructure, including GPU management and workload optimization

Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)

Experience working with cloud platforms such as AWS, Azure, or GCP

Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

Why Work at Zyphra :

Our research methodology is to make grounded, methodical steps toward ambitious goals. Both deep research and engineering excellence are equally valued

We strongly value new and crazy ideas and are very willing to bet big on new ideas

We move as quickly as we can; we aim to minimize the bar to impact as low as possible

We all enjoy what we do and love discussing AI

Benefits and Perks :

Comprehensive medical, dental, vision, and FSA plans

Competitive compensation and 401(k)

Relocation and immigration support on a case-by-case basis

On-site meals prepared by a dedicated culinary team; Thursday Happy Hours

In-person team in Palo Alto, CA, with a collaborative, high-energy environment

If you are excited to bring reliability best practices to the frontier of AI infrastructure, this job is for you. Apply Today!

serp_jobs.job_alerts.create_a_job

Site Reliability Engineer • Palo Alto, California, United States

Job_description.internal_linking.related_jobs
Principal Site Reliability Engineer (SASE)

Principal Site Reliability Engineer (SASE)

Palo Alto Networks • Cupertino, California, United States
serp_jobs.job_card.full_time
At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer - SRE at Descope Los Altos, CA

Site Reliability Engineer - SRE at Descope Los Altos, CA

Itlearn360 • Los Altos, CA, United States
serp_jobs.job_card.full_time
Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Technology Site Reliability Engineer

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

FLUIX • Palo Alto, CA, United States
serp_jobs.job_card.full_time
FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure.We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and pow...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Sr. Site Reliability Engineer

Sr. Site Reliability Engineer

Globality • Palo Alto, California, United States
serp_jobs.job_card.full_time
Joel Hyatt and Lior Delgo founded Globality with a vision to create prosperous and healthy economies, companies, communities, and individuals. In this new era of the Autonomous Enterprise, Globality...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Cypress HCM • Fremont, CA, United States
serp_jobs.job_card.full_time
As a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principl...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Psiquantum • Palo Alto, California, United States
serp_jobs.job_card.full_time
Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Foxconn Industrial Internet - FII • San Jose, CA, US
serp_jobs.job_card.full_time +1
serp_jobs.filters_job_card.quick_apply
Site Reliability Engineer Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, p...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30
Sr. Reliability Engineer (26861)

Sr. Reliability Engineer (26861)

Supermicro • San Jose, CA, United States
serp_jobs.job_card.full_time
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Natcast • Sunnyvale, California, United States
serp_jobs.job_card.full_time
Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer (SRE) - Media Production Infrastructure

Site Reliability Engineer (SRE) - Media Production Infrastructure

Monks • Cupertino, California, United States
serp_jobs.job_card.full_time
Please note that we will never request payment or bank account information at any stage of the recruitment process.As we continue to grow our teams, we urge you to be cautious of fraudulent job pos...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

black.ai • Palo Alto, CA, United States
serp_jobs.job_card.full_time
Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Tarana Wireless • Milpitas, California, United States
serp_jobs.job_card.full_time
Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Key2Source • San Leandro, California, USA
serp_jobs.job_card.full_time
Job Title : Site Reliability Engineer.Location : San Leandro CA (Onsite).Engineering experience or equivalent demonstrated through one or a combination of the following : work experience training mili...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer - Supercomputing

Site Reliability Engineer - Supercomputing

Xai • Palo Alto, California, United States
serp_jobs.job_card.full_time
AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Site Reliability Engineer – Kubernetes

Site Reliability Engineer – Kubernetes

Theklicker • Palo Alto, CA, United States
serp_jobs.job_card.full_time
We are dedicated to being a one-stop solution for purchasing electronic products.With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Site Reliability Engineer

Site Reliability Engineer

Paynearme • Cupertino, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted