Talent.com
Senior Quality and Reliability Engineer, Trainium Servers and Systems
Senior Quality and Reliability Engineer, Trainium Servers and SystemsAmazon • Cupertino, CA, United States
serp_jobs.error_messages.no_longer_accepting
Senior Quality and Reliability Engineer, Trainium Servers and Systems

Senior Quality and Reliability Engineer, Trainium Servers and Systems

Amazon • Cupertino, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Join us at Amazon Web Services (AWS) as a Senior Quality and Reliability Engineer for Trainium Servers and Systems and become part of an innovative team engaged in cutting-edge technology!

The Trainium Manufacturing, Quality and Reliability (MQR) Team is an integral part of AWS Annapurna Labs, focused on designing exceptional Machine Learning products for the world's leading Cloud Services provider. In this role, you will collaborate with experienced professionals across various disciplines to conceive and develop robust infrastructure technologies. Your contributions will be vital to key aspects of product definition, execution, and testing in manufacturing.

Key Responsibilities :

  • Lead the validation of tests for future technologies, ensuring we maintain our high standards of quality.
  • Implement manufacturing process improvements to proactively address any reliability issues.
  • Qualify manufacturing lines for large-scale production, guaranteeing efficiency and effectiveness.
  • Apply your deep understanding of reliability statistics and tests to influence design decisions that enhance product reliability.
  • Identify and assess product / component risks, working alongside design teams to mitigate them and define comprehensive test methodologies.
  • Conduct in-depth technological analyses to align with the product roadmap.
  • Provide technical mentorship to fellow engineers, elevating team capabilities.
  • Forecast reliability predictions for potential failure mechanisms, both for developing and currently deployed products.
  • Collaborate with multiple vendors and Original Design Manufacturers (ODMs) to establish standardized manufacturing and reliability expectations.

About the Team :

Our Annapurna Labs subsidiary specializes in creating custom silicon and servers, including the Nitro, Graviton, Inferentia, and Trainium families of processors. The Machine Learning Annapurna (MLA) team encompasses a vertically integrated structure, incorporating software, firmware, hardware, and silicon design within one organization. You will be part of the Manufacturing, Quality, and Reliability team dedicated to Hardware Development, Software Development, and Fleet Operations Systems.

Basic Qualifications :

  • Bachelor's or Master's degree in Reliability Engineering, Physics, or a related field, or equivalent experience.
  • 7+ years of Reliability Engineering experience with server compute platforms or high-tech hardware.
  • Preferred Qualifications :

  • Master's Degree or PhD in Reliability Engineering or similar discipline.
  • Proven ability to identify and resolve systemic issues prior to New Product Introduction (NPI).
  • Working knowledge of server components including CPU, memory, HDD, SSD, and motherboard.
  • Experience in analytical test planning and procedure development related to server compute platforms.
  • Demonstrated capability to achieve ambitious goals in a fast-paced environment.
  • Adept at driving root cause analysis activities for failure assessment.
  • Ability to effectively collaborate within a diverse team.
  • Experience in reliability modeling and materials characterization.
  • Able to influence development teams, procurement, and external partners.
  • Amazon is committed to fostering a diverse and inclusive workplace. We are an equal opportunity employer, encouraging applications from candidates of all backgrounds. We value diversity and provide equal employment opportunities for individuals regardless of their race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or any other legally protected status.

    Applicants should apply through our internal or external career site.

    serp_jobs.job_alerts.create_a_job

    Senior Reliability Engineer • Cupertino, CA, United States

    Job_description.internal_linking.related_jobs
    Senior Site Reliability Engineer (Cortex)

    Senior Site Reliability Engineer (Cortex)

    Palo Alto Networks • Santa Clara, California, United States
    serp_jobs.job_card.full_time
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Kubernetes & Infrastructure Engineer

    Senior Kubernetes & Infrastructure Engineer

    Third Wave Automation • Union City, California, United States
    serp_jobs.job_card.full_time
    Third Wave Automation is a rapidly growing startup that has demonstrated its core technology components, proven its market fit, and just closed its Series C funding. If you are excited about cutting...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Kubernetes SRE : Reliable Systems & Automation

    Kubernetes SRE : Reliable Systems & Automation

    Theklicker • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    A leading online platform for electronics in Palo Alto is looking for a Site Reliability Engineer to maintain system reliability and support software deployments. This full-time on-site role involve...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer (Senior SRE)

    Senior Site Reliability Engineer (Senior SRE)

    Ciroos • Pleasanton, California, United States
    serp_jobs.job_card.full_time
    Ciroos (pronounced “Sai rose”) is a seed-stage startup founded in February 2025 by a team of experienced executives and distinguished engineers with deep expertise in observability, AI, distributed...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior / Staff Site Reliability Engineer

    Senior / Staff Site Reliability Engineer

    Gatik Ai • Mountain View, California, United States
    serp_jobs.job_card.full_time
    Gatik, the leader in autonomous middle-mile logistics, is revolutionizing the B2B supply chain with its autonomous transportation-as-a-service (ATaaS) solution and prioritizing safe, consistent del...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Genomics Senior Systems Architect

    Genomics Senior Systems Architect

    University of California - Santa Cruz • Santa Cruz, CA, United States
    serp_jobs.job_card.full_time +1
    NO VISA SPONSORSHIP AVAILABLE FOR THIS POSITION.Applicants must have current work authorization when accepting a Genomics Institute staff position. We are unable to sponsor or take over sponsorship ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Staff Machine Learning Engineer - Site Reliability Engineer

    Senior Staff Machine Learning Engineer - Site Reliability Engineer

    Servicenow • Santa Clara, California, United States
    serp_jobs.job_card.full_time +1
    It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer, Telecom & SMS

    Staff Site Reliability Engineer, Telecom & SMS

    Ez Texting • San Jose, California, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Who We Are EZ Texting is a recognized leader in text message marketing for small and medium-sized businesses and organizations, setting the standard for professional texting.Our messaging solutions...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Psiquantum • Palo Alto, California, United States
    serp_jobs.job_card.full_time
    Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Sr. Reliability Engineer (26861)

    Sr. Reliability Engineer (26861)

    Supermicro • San Jose, CA, United States
    serp_jobs.job_card.full_time
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Tarana Wireless • Milpitas, California, United States
    serp_jobs.job_card.full_time
    Join the Team That's Redefining Wireless Technology.Our groundbreaking Fixed Wireless Access technology is delivering .As a Site Reliability Engineer, you will help us manage software that runs on ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Natcast • Sunnyvale, California, United States
    serp_jobs.job_card.full_time
    Natcast (short for The National Center for the Advancement of Semiconductor Technology) is a new, purpose-built, non-profit entity created to operate the National Semiconductor Technology Center (N...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    black.ai • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Quantum computing holds the promise of humanity’s mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior Infrastructure Linux & DevOps Engineer

    Senior Infrastructure Linux & DevOps Engineer

    Matrix Precise, Inc. • Pleasanton, California, United States
    serp_jobs.job_card.full_time
    Infra Linux Engineer’s primary function will be to advance the infrastructure team from a traditional infrastructure methodology to an infrastructure as code approach. You will be responsible for ma...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer (SRE) at OPPO US Research Center Palo Alto, CA

    Site Reliability Engineer (SRE) at OPPO US Research Center Palo Alto, CA

    OPPO US Research Center • Palo Alto, CA, United States
    serp_jobs.job_card.full_time
    Site Reliability Engineer (SRE) job at OPPO US Research Center.OPPO US Research Center is seeking a skilled and proactive. Site Reliability Engineer (SRE).In this role, you will be responsible for e...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer - Supercomputing

    Site Reliability Engineer - Supercomputing

    Xai • Palo Alto, California, United States
    serp_jobs.job_card.full_time
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Paynearme • Cupertino, California, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payments.Our ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted