Talent.com
Software Engineer, SystemML - Scaling / Performance
Software Engineer, SystemML - Scaling / PerformanceMETA • Menlo Park, CA, United States
serp_jobs.error_messages.no_longer_accepting
Software Engineer, SystemML - Scaling / Performance

Software Engineer, SystemML - Scaling / Performance

META • Menlo Park, CA, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

In this role, you will be a member of the Network.AI Software team and part of the bigger DC networking organization. The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives. NCCL has been integrated into PyTorch and is on the critical path of multi-GPU distributed training. In other words, nearly every distributed GPU-based ML workload in Meta Production goes through the SW stack the team owns. At the high level, the team aims to enable Meta-wide ML products and innovations to leverage our large-scale GPU training and inference fleet through an observable, reliable and high-performance distributed AI / GPU communication stack. Currently, one of the team's focus is on building customized features, SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML reliability and performance (e.g. Large-Scale GenAI / LLM training) from the trainer down to the inter-GPU and network communication layer. And we are seeking for engineers to work on the space of GenAI / LLM scaling reliability and performance.

Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI / LLM scaling

Minimum Qualifications

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in one or more of the following machine learning / deep learning domains : Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)

Preferred Qualifications

  • Knowledge of GPU architectures and CUDA programming
  • Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
  • PhD in Computer Science, Computer Engineering, or relevant technical field
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
  • Experience in HPC and parallel computing
  • Knowledge of ML, deep learning and LLM
  • Experience with NCCL and distributed GPU reliability / performance improvment on RoCE / Infiniband
  • About Meta

    Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.

    Equal Employment Opportunity

    Meta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice here.

    serp_jobs.job_alerts.create_a_job

    Engineer Performance • Menlo Park, CA, United States

    Job_description.internal_linking.related_jobs
    Principal Software Engineer AI Platform

    Principal Software Engineer AI Platform

    Snorkel Ai • Redwood City, California, United States
    serp_jobs.job_card.full_time
    At Snorkel, we believe meaningful AI doesn’t start with the model, it starts with the data.We’re on a mission to help enterprises transform expert knowledge into specialized AI at scale.The AI land...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior ML Ops Engineer | Distributed Systems Lead

    Senior ML Ops Engineer | Distributed Systems Lead

    Baton Trucking, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading logistics technology company in San Francisco is seeking a Staff Software Engineer - ML Ops to enhance its machine learning infrastructure. The role involves building robust distributed sy...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Software Engineer - Infrastructure, Machine Learning

    Senior Software Engineer - Infrastructure, Machine Learning

    Baton • San Francisco, California, United States
    serp_jobs.job_card.full_time
    With $10B in freight under management, our technology reaches every part of the U.We design and ship category-defining software that enables Ryder and its 50,000+ customers—including some of the wo...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Software Engineer, Machine Learning

    Senior Software Engineer, Machine Learning

    Planet Labs PBC • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    We believe in using space to help life on Earth.Planet designs, builds, and operates the largest constellation of imaging satellites in history. This constellation delivers an unprecedented dataset ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior AI / ML Software Engineer (Remote in California)

    Senior AI / ML Software Engineer (Remote in California)

    Rocket Lawyer • San Francisco, California, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    We believe everyone deserves access to affordable and simple legal services.Founded in 2008, Rocket Lawyer is the largest and most widely used online legal service platform in the world.With office...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    ML Engineer

    ML Engineer

    Wispr Flow • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Wispr Flow is making it as effortless to interact with your devices as talking to a close friend.Voice is the most natural, powerful way to communicate — and we’re building the interfaces to make t...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Machine Learning Engineer

    Senior Machine Learning Engineer

    Suki • Redwood City, California, United States
    serp_jobs.job_card.full_time
    The Future of Healthcare Needs You.At Suki, we’re building technology that listens, understands, and gets out of the way — so clinicians can get back to being clinicians. AI to automate clinical doc...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior ML Systems Engineer : Scalable Training Frameworks

    Senior ML Systems Engineer : Scalable Training Frameworks

    Cohere • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading AI research firm located in San Francisco is seeking a Senior ML Systems Engineer to build and maintain the training framework for large-scale language models. The role involves designing ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Software Engineer - Scalable Systems & ML Automation

    Senior Software Engineer - Scalable Systems & ML Automation

    Amazon • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading tech company in San Francisco is looking for a Software Development Engineer to design and develop systems that solve complex problems. The ideal candidate will have over 3 years of profes...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Senior ML Engineer : Scalable Feature Platform, Remote

    Senior ML Engineer : Scalable Feature Platform, Remote

    Block • San Francisco, CA, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    A technology company in San Francisco is seeking a Senior Machine Learning Engineer to develop and maintain large-scale ML systems. You will own critical feature pipelines, contribute to the archite...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Software Engineer - Machine Learning- Publica by IAS

    Senior Software Engineer - Machine Learning- Publica by IAS

    Publica • San Francisco, California, United States
    serp_jobs.job_card.full_time
    At Publica, engineers have a unique opportunity to work on a platform that handles billions of requests per hour in one of the fastest growing areas in Ad Tech : Connected Television.Engineers at Pu...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Engineer, Machine Learning Infrastructure

    Software Engineer, Machine Learning Infrastructure

    Datologyai • Redwood City, California, United States
    serp_jobs.job_card.full_time
    Companies want to train their own large models on their own data.The current industry standard is to train on a random sample of your data, which is inefficient at best and actively harmful to mode...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Software Engineer - Machine Learning

    Senior Software Engineer - Machine Learning

    Celonis • Redwood City, California, United States
    serp_jobs.job_card.full_time
    We're Celonis, the global leader in Process Intelligence technology and one of the world's fastest-growing SaaS firms.We believe there is a massive opportunity to unlock productivity by placing AI,...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior / Staff Machine Learning Engineer

    Senior / Staff Machine Learning Engineer

    Dexterity • Redwood City, California, United States
    serp_jobs.job_card.full_time
    At Dexterity, we believe robots can positively transform the world.Our breakthrough technology frees people to do the creative, inspiring, problem-solving jobs that humans do best by enabling robot...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal AI Platform Engineer — Lead Scalable ML Infra

    Principal AI Platform Engineer — Lead Scalable ML Infra

    Snorkel AI • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading AI company is seeking a Principal Software Engineer to drive product and technical systems for AI challenges.You will shape engineering culture, deliver major features, and mentor enginee...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Software Engineer - Machine Learning Platform

    Software Engineer - Machine Learning Platform

    Snowflake • Menlo Park, California, United States
    serp_jobs.job_card.full_time
    The Snowflake Machine Learning Platform team’s mission is to enable customers to bring their ML / AI workload to Snowflake. Our customers want to leverage ML / AI to extract business values from ever in...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal Software Engineer

    Principal Software Engineer

    Informatica LLC • Redwood City, CA, United States
    serp_jobs.job_card.full_time
    Build Your Career at Informatica.We seek innovative thinkers who believe in the power of data to drive meaningful change. At Informatica, we welcome adventurous, work-from-anywhere minds eager to so...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Engineer (ML Platform)

    Software Engineer (ML Platform)

    Anyscale • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Ray in their tech stacks to accelerate the progress of AI applications out into the real world.With Anyscale, we’re building the best place to run Ray, so that any developer or data scientist can s...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted