Senior Quality and Reliability Engineer, Trainium Servers and SystemsAmazon • Cupertino, CA, United States

serp_jobs.error_messages.no_longer_accepting

Senior Quality and Reliability Engineer, Trainium Servers and Systems

Amazon • Cupertino, CA, United States

job_description.job_card.30_days_ago

serp_jobs.job_preview.job_type

serp_jobs.job_card.full_time

job_description.job_card.job_description

Join us at Amazon Web Services (AWS) as a Senior Quality and Reliability Engineer for Trainium Servers and Systems and become part of an innovative team engaged in cutting-edge technology!

The Trainium Manufacturing, Quality and Reliability (MQR) Team is an integral part of AWS Annapurna Labs, focused on designing exceptional Machine Learning products for the world's leading Cloud Services provider. In this role, you will collaborate with experienced professionals across various disciplines to conceive and develop robust infrastructure technologies. Your contributions will be vital to key aspects of product definition, execution, and testing in manufacturing.

Key Responsibilities :

Lead the validation of tests for future technologies, ensuring we maintain our high standards of quality.
Implement manufacturing process improvements to proactively address any reliability issues.
Qualify manufacturing lines for large-scale production, guaranteeing efficiency and effectiveness.
Apply your deep understanding of reliability statistics and tests to influence design decisions that enhance product reliability.
Identify and assess product / component risks, working alongside design teams to mitigate them and define comprehensive test methodologies.
Conduct in-depth technological analyses to align with the product roadmap.
Provide technical mentorship to fellow engineers, elevating team capabilities.
Forecast reliability predictions for potential failure mechanisms, both for developing and currently deployed products.
Collaborate with multiple vendors and Original Design Manufacturers (ODMs) to establish standardized manufacturing and reliability expectations.

About the Team :

Our Annapurna Labs subsidiary specializes in creating custom silicon and servers, including the Nitro, Graviton, Inferentia, and Trainium families of processors. The Machine Learning Annapurna (MLA) team encompasses a vertically integrated structure, incorporating software, firmware, hardware, and silicon design within one organization. You will be part of the Manufacturing, Quality, and Reliability team dedicated to Hardware Development, Software Development, and Fleet Operations Systems.

Basic Qualifications :

Bachelor's or Master's degree in Reliability Engineering, Physics, or a related field, or equivalent experience.

7+ years of Reliability Engineering experience with server compute platforms or high-tech hardware.

Preferred Qualifications :

Master's Degree or PhD in Reliability Engineering or similar discipline.

Proven ability to identify and resolve systemic issues prior to New Product Introduction (NPI).

Working knowledge of server components including CPU, memory, HDD, SSD, and motherboard.

Experience in analytical test planning and procedure development related to server compute platforms.

Demonstrated capability to achieve ambitious goals in a fast-paced environment.

Adept at driving root cause analysis activities for failure assessment.

Ability to effectively collaborate within a diverse team.

Experience in reliability modeling and materials characterization.

Able to influence development teams, procurement, and external partners.

Amazon is committed to fostering a diverse and inclusive workplace. We are an equal opportunity employer, encouraging applications from candidates of all backgrounds. We value diversity and provide equal employment opportunities for individuals regardless of their race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or any other legally protected status.

Applicants should apply through our internal or external career site.

serp_jobs.job_alerts.create_a_job

Senior Reliability Engineer • Cupertino, CA, United States

Job_description.internal_linking.related_jobs

Senior Site Reliability Engineer (Cortex)

Palo Alto Networks • Santa Clara, California, United States

serp_jobs.job_card.full_time

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior Kubernetes & Infrastructure Engineer

Third Wave Automation • Union City, California, United States

serp_jobs.job_card.full_time

Third Wave Automation is a rapidly growing startup that has demonstrated its core technology components, proven its market fit, and just closed its Series C funding. If you are excited about cutting...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Kubernetes SRE : Reliable Systems & Automation

Theklicker • Palo Alto, CA, United States

serp_jobs.job_card.full_time

A leading online platform for electronics in Palo Alto is looking for a Site Reliability Engineer to maintain system reliability and support software deployments. This full-time on-site role involve...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Senior Site Reliability Engineer (Senior SRE)

Ciroos • Pleasanton, California, United States

serp_jobs.job_card.full_time

Ciroos (pronounced “Sai rose”) is a seed-stage startup founded in February 2025 by a team of experienced executives and distinguished engineers with deep expertise in observability, AI, distributed...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior / Staff Site Reliability Engineer

Gatik Ai • Mountain View, California, United States

serp_jobs.job_card.full_time

Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent del...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States

serp_jobs.job_card.full_time

Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Genomics Senior Systems Architect

University of California - Santa Cruz • Santa Cruz, CA, United States

serp_jobs.job_card.full_time +1

NO VISA SPONSORSHIP AVAILABLE FOR THIS POSITION.Applicants must have current work authorization when accepting a Genomics Institute staff position. We are unable to sponsor or take over sponsorship ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Senior Staff Machine Learning Engineer - Site Reliability Engineer

Servicenow • Santa Clara, California, United States

serp_jobs.job_card.full_time +1

It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Staff Site Reliability Engineer, Telecom & SMS

Ez Texting • San Jose, California, United States

serp_jobs.filters.remote

serp_jobs.job_card.full_time

Who We Are EZ Texting is a recognized leader in text message marketing for small and medium-sized businesses and organizations, setting the standard for professional texting.Our messaging solutions...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer

Psiquantum • Palo Alto, California, United States

serp_jobs.job_card.full_time

Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Sr. Reliability Engineer (26861)

Supermicro • San Jose, CA, United States

serp_jobs.job_card.full_time

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer

Tarana Wireless • Milpitas, California, United States

serp_jobs.job_card.full_time

Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer

Natcast • Sunnyvale, California, United States

serp_jobs.job_card.full_time

Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer

black.ai • Palo Alto, CA, United States

serp_jobs.job_card.full_time

serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted

Senior Infrastructure Linux & DevOps Engineer

Matrix Precise, Inc. • Pleasanton, California, United States

serp_jobs.job_card.full_time

Infra Linux Engineer’s primary function will be to advance the infrastructure team from a traditional infrastructure methodology to an infrastructure as code approach. You will be responsible for ma...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer (SRE) at OPPO US Research Center Palo Alto, CA

OPPO US Research Center • Palo Alto, CA, United States

serp_jobs.job_card.full_time

Site Reliability Engineer (SRE) job at OPPO US Research Center.OPPO US Research Center is seeking a skilled and proactive. Site Reliability Engineer (SRE).In this role, you will be responsible for e...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer - Supercomputing

Xai • Palo Alto, California, United States

serp_jobs.job_card.full_time

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Site Reliability Engineer

Paynearme • Cupertino, California, United States

serp_jobs.filters.remote

serp_jobs.job_card.full_time

At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted