Talent.com
LLM Training Frameworks and Optimization Engineer
LLM Training Frameworks and Optimization EngineerTogether AI • San Francisco, CA, United States
LLM Training Frameworks and Optimization Engineer

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

  • Framework Development and Optimization :
  • Design, implement, and optimize distributed training frameworks tailored for large language models.
  • Develop custom modules, plugins, and features to enhance framework scalability and performance.
  • Algorithmic and Systems Optimization :
  • Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
  • Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
  • Performance Tuning :
  • Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
  • Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
  • Scalability and Resilience :
  • Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
  • Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
  • Collaboration and Support :
  • Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
  • Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Requirements

Must-Have :

  • Experience :
  • 5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.
  • Technical Skills :
  • Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).
  • Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).
  • Familiarity with GPU / TPU hardware and deep learning performance optimizations.
  • Programming :
  • Proficient in Python and C++ or CUDA for high-performance computing.
  • Optimization Techniques :
  • Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).
  • Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.
  • Soft Skills :
  • Analytical problem-solving skills and a focus on performance improvement.
  • Strong collaboration and communication skills across teams.
  • Nice-to-Have

  • Familiarity with graph optimization and compiler-level performance tuning.
  • Contributions to open-source deep learning or distributed training projects.
  • Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).
  • About Together AI

    Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

    Compensation

    We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

    Equal Opportunity

    Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

    Please see our privacy policy at https : / / www.together.ai / privacy

    Seniority level

    Seniority level

    Mid-Senior level

    Employment type

    Employment type

    Full-time

    Job function

    Job function

    Engineering and Information Technology

    Industries

    Software Development

    Referrals increase your chances of interviewing at Together AI by 2x

    Sign in to set job alerts for “Optimization Engineer” roles.

    San Francisco, CA $167,000.00-$185,500.00 6 days ago

    San Francisco, CA $130,000.00-$145,000.00 2 weeks ago

    Staff Optimization Engineer, Dynamic Pricing

    San Francisco, CA $223,000.00-$248,000.00 13 hours ago

    San Francisco, CA $120,000.00-$180,000.00 4 months ago

    Machine Learning Engineer, Forecast Platform

    San Francisco, CA $198,000.00-$220,000.00 5 days ago

    Machine Learning Engineer II - Autonomous Mobility and Delivery

    San Francisco, CA $167,000.00-$185,500.00 3 days ago

    Oakland, CA $90,000.00-$122,000.00 12 hours ago

    San Francisco, CA $120,000.00-$160,000.00 2 weeks ago

    San Francisco, CA $217,400.00-$294,100.00 14 hours ago

    San Francisco, CA $209,700.00-$283,800.00 14 hours ago

    GenAI Staff Machine Learning Engineer, Performance Optimization

    San Francisco, CA $149,998.00-$250,000.00 9 months ago

    San Francisco, CA $117,000.00-$150,000.00 1 month ago

    Process Engineer, application via RippleMatch

    San Francisco, CA $75,000.00-$150,003.00 10 months ago

    San Francisco, CA $117,000.00-$150,000.00 3 weeks ago

    Software Engineer, Performance Optimization

    Redwood City, CA $175,000.00-$220,000.00 1 month ago

    Process Engineer, application via RippleMatch

    Redwood City, CA $142,000.00-$158,000.00 3 weeks ago

    Staff Deep Learning Engineer, Perception

    San Francisco, CA $193,375.00-$227,500.00 5 months ago

    San Mateo, CA $233,840.00-$283,780.00 3 days ago

    San Francisco, CA $100,000.00-$150,000.00 2 weeks ago

    We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

    #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Llm Engineer • San Francisco, CA, United States

    Job_description.internal_linking.related_jobs
    ML Systems Engineer : Distributed LLM Training & Inference

    ML Systems Engineer : Distributed LLM Training & Inference

    Scale AI • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experien...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Manager, REMS Data Programmer

    Senior Manager, REMS Data Programmer

    Jazz Pharmaceuticals • Redwood City, California, USA
    serp_jobs.job_card.full_time
    If you are a current Jazz employee please apply via the Internal Career site.Jazz Pharmaceuticals is a global biopharma company whose purpose is to innovate to transform the lives of patients and ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    AI Engineer, Evaluation and Reliability

    AI Engineer, Evaluation and Reliability

    Mice Groups • Redwood City, CA, US
    serp_jobs.job_card.permanent
    Senior Engineer, AI Evaluation and Reliability / Contract-to-Hire or Direct Hire / Redwood City / Hybrid, onsite 3 days per week / This position pays $70-80 / hr. W2 for Contract, $140-190K annually u...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    EH&S Training Lead (4164U) - #82663

    EH&S Training Lead (4164U) - #82663

    University of California-Berkeley • Berkeley, CA, United States
    serp_jobs.job_card.full_time +1
    At the University of California, Berkeley, we are dedicated to fostering a community where everyone feels welcome and can thrive. Our culture of openness, freedom and belonging make it a special pla...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    LVN / LPT

    LVN / LPT

    Stars Behavioral Health Group • Redwood City, CA, United States
    serp_jobs.job_card.full_time
    Partner with us in making a positive change!.Join a team where your work truly matters.We're proud to have been certified as a Great Place to Work for 8 years by our own employees.We invite you to ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    LVN / LPN

    LVN / LPN

    Crossover Health • Menlo Park, CA, United States
    serp_jobs.job_card.full_time
    Crossover Health is creating the future of health as it should be.A national, team-based medical group with a focus on wellbeing and prevention that extends beyond traditional sick care, the compan...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    LRS Healthcare • Menlo Park, CA, US
    serp_jobs.job_card.full_time
    Ready to start your next travel adventure? LRS Healthcare offers a full benefits package, 24 / 7 support, and a responsive, traveler-first culture. What are you waiting for? Apply today!.Valid license...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    AlliedTravelNetwork • Menlo Park, CA, US
    serp_jobs.job_card.full_time
    AlliedTravelNetwork is working with LRS Healthcare to find a qualified Echo Tech in Menlo Park, California, 94025!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits p...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Engineer, AI Evaluation & Reliability (Agentic AI)

    Senior Engineer, AI Evaluation & Reliability (Agentic AI)

    Anomali • Redwood City, CA, United States
    serp_jobs.job_card.full_time
    Anomali is headquartered in Silicon Valley and is the Leading AI-Powered Security Operations Platform that is modernizing security operations. At the center of it is an omnipresent, intelligent, and...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Therapy - OT

    Therapy - OT

    California Rehabilitation and Sports Therapy • Vallejo, CA, United States
    serp_jobs.job_card.full_time
    California Rehabilitation and Sports Therapy - Vallejo.Are you ready to take your Travel career to the next level? See places you have not seen before? Ventura's MedStaff tenured Recruiters are her...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Remote Financial Advising Expert - AI Trainer ($50-$60 / hour)

    Remote Financial Advising Expert - AI Trainer ($50-$60 / hour)

    Data Annotation • Vallejo, California
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    We are looking for a finance professional to join our team to train AI models.You will measure the progress of these AI chatbots, evaluate their logic, and solve problems to improve the quality of ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Leadership Development Trainer

    Leadership Development Trainer

    Telecare Corporation • Oakland, CA, United States
    serp_jobs.job_card.full_time
    Telecare's mission is to deliver excellent and effective behavioral health services that engage individuals with complex needs in recovering their health, hopes, and dreams.Telecare continues to ad...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Machine Learning Engineer

    Machine Learning Engineer

    Jobot • San Francisco, CA, US
    serp_jobs.job_card.full_time
    Entry Level ML Engineer Needed for Growing AI Startup!.This Jobot Job is hosted by : Reed Kellick.Are you a fit? Easy Apply now by clicking the "Apply Now" button and sending us your resume.Salary : ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

    AlliedTravelCareers • Menlo Park, CA, US
    serp_jobs.job_card.full_time
    AlliedTravelCareers is working with LRS Healthcare to find a qualified Echo Tech in Menlo Park, California, 94025!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits p...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Remote Senior Machine Learning Engineer - LLM Evaluation / Task Creations (India Based) - AI Trainer ($21-$21 per hour)

    Remote Senior Machine Learning Engineer - LLM Evaluation / Task Creations (India Based) - AI Trainer ($21-$21 per hour)

    Mercor • Richmond, California, US
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Role Description • • Mercor is hiring on behalf of a leading AI research lab to bring on highly skilled • •Machine Learning Engineers • • with a proven record of building, training, and evaluating high-...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    LLM Engineer (Alameda)

    LLM Engineer (Alameda)

    PeopleCaddie • Alameda, CA, United States
    serp_jobs.job_card.full_time
    San Jose, CA (Bay Area) - 2x / wk at client site.Up to $90 per hour (W2), depending on experience.Month (with possible extension). We are seeking a highly experienced.The ideal candidate has deep tech...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Psychiatric LVN | $40 / hour | December / January Start Date - Bay Area

    Psychiatric LVN | $40 / hour | December / January Start Date - Bay Area

    Amergis • Emeryville, CA, United States
    serp_jobs.job_card.full_time
    The Psychiatric or Mental / Behavioral Health LPN is responsible and accountable for the application of the nursing process and the delivery of patient care in the psychiatric unit of a hospital, men...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    LVN / LPT - Mental Health 105

    LVN / LPT - Mental Health 105

    2025 July 17th Virtual Fair - TELEPORT • Redwood City, CA, United States
    serp_jobs.job_card.part_time
    Sage House is a licensed 16-bed mental health rehabilitation center (MHRC) serving San Mateo County residents, 1859, with long histories of mental illness and multiple episodes of acute psychiatric...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted