Join us at Amazon Web Services (AWS) as a Senior Quality and Reliability Engineer for Trainium Servers and Systems and become part of an innovative team engaged in cutting-edge technology!
The Trainium Manufacturing, Quality and Reliability (MQR) Team is an integral part of AWS Annapurna Labs, focused on designing exceptional Machine Learning products for the world's leading Cloud Services provider. In this role, you will collaborate with experienced professionals across various disciplines to conceive and develop robust infrastructure technologies. Your contributions will be vital to key aspects of product definition, execution, and testing in manufacturing.
Key Responsibilities :
- Lead the validation of tests for future technologies, ensuring we maintain our high standards of quality.
- Implement manufacturing process improvements to proactively address any reliability issues.
- Qualify manufacturing lines for large-scale production, guaranteeing efficiency and effectiveness.
- Apply your deep understanding of reliability statistics and tests to influence design decisions that enhance product reliability.
- Identify and assess product / component risks, working alongside design teams to mitigate them and define comprehensive test methodologies.
- Conduct in-depth technological analyses to align with the product roadmap.
- Provide technical mentorship to fellow engineers, elevating team capabilities.
- Forecast reliability predictions for potential failure mechanisms, both for developing and currently deployed products.
- Collaborate with multiple vendors and Original Design Manufacturers (ODMs) to establish standardized manufacturing and reliability expectations.
About the Team :
Our Annapurna Labs subsidiary specializes in creating custom silicon and servers, including the Nitro, Graviton, Inferentia, and Trainium families of processors. The Machine Learning Annapurna (MLA) team encompasses a vertically integrated structure, incorporating software, firmware, hardware, and silicon design within one organization. You will be part of the Manufacturing, Quality, and Reliability team dedicated to Hardware Development, Software Development, and Fleet Operations Systems.
Basic Qualifications :
Bachelor's or Master's degree in Reliability Engineering, Physics, or a related field, or equivalent experience.7+ years of Reliability Engineering experience with server compute platforms or high-tech hardware.Preferred Qualifications :
Master's Degree or PhD in Reliability Engineering or similar discipline.Proven ability to identify and resolve systemic issues prior to New Product Introduction (NPI).Working knowledge of server components including CPU, memory, HDD, SSD, and motherboard.Experience in analytical test planning and procedure development related to server compute platforms.Demonstrated capability to achieve ambitious goals in a fast-paced environment.Adept at driving root cause analysis activities for failure assessment.Ability to effectively collaborate within a diverse team.Experience in reliability modeling and materials characterization.Able to influence development teams, procurement, and external partners.Amazon is committed to fostering a diverse and inclusive workplace. We are an equal opportunity employer, encouraging applications from candidates of all backgrounds. We value diversity and provide equal employment opportunities for individuals regardless of their race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or any other legally protected status.
Applicants should apply through our internal or external career site.