Production Network Engineer
Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing use cases of AI. We need to build, scale and evolve our network infrastructure that connects myriads of GPUs together. Simple, elegant, and scalable network design, automation, and data analytics are the keys to meeting our demands. In this role, you will be part of a team that is responsible for conceiving design solutions, developing, testing and deploying network software, systems, and tools that keep the Data Center network operating at maximum reliability, scalability, and efficiency. Engineers in this role are hybrid software and network engineers who leverage their network engineering skills to research and design new generation of network architectures and related systems and use their software development skills to reliably introduce them at scale in production.
Production Network Engineer Responsibilities
- Partner with network hardware, software, and vendor teams on the design and development of network topologies and network platforms (switch and optics)
- Codify the network designs by partnering with the in-house Software Engineer, Tooling, Planning, Simulation, and Delivery teams
- Develop test automation frameworks integrated in Continuous Integration / Continuous Deployment pipeline to qualify network hardware and software stack for both in-house Facebook Open Switching System(FBOSS) and Vendor platforms before push in production
- Develop tests that qualify complex network migration procedures in lab / emulation before executing the same in production
- Work closely with our hardware, software and sourcing teams to develop new networking solutions and influence the future of networking and its associated infrastructure
- Be oncall to learn from real world production challenges and take the lessons to improve current and future generation products
Minimum Qualifications
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience6+ years of experience working on networks supporting large scale training workloadsExperience in designing, deploying and operating datacenter networks at scaleExperience coding in languages like Python, C++, GoExperience in network automation software leveraging software defined networking principlesExperience configuring and troubleshooting routing and switching protocols (BGP, IS-IS, OSPF, MPLS, RSVP-TE)Working knowledge of network protocols (TCP / UDP, DHCP, DNS) and experience with IPv4 and IPv6Preferred Qualifications
Understanding of AI training workloads and demands they exert on networksUnderstanding of RDMA congestion control mechanisms on RoCE NetworksWorking knowledge of 40 / 100 / 400G Ethernet and CWDM, DWDM and optical transport network technologiesUnderstanding of different Optics and internals of a switch ASIC