LLM Training Frameworks and Optimization EngineerTogether AI • San Francisco, CA, United States

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

job_description.job_card.30_days_ago

serp_jobs.job_preview.job_type

serp_jobs.job_card.full_time

job_description.job_card.job_description

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

Role

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

Framework Development and Optimization :
Design, implement, and optimize distributed training frameworks tailored for large language models.
Develop custom modules, plugins, and features to enhance framework scalability and performance.
Algorithmic and Systems Optimization :
Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
Performance Tuning :
Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
Scalability and Resilience :
Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
Collaboration and Support :
Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Requirements

Must-Have :

Experience :

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.

Technical Skills :

Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).

Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).

Familiarity with GPU / TPU hardware and deep learning performance optimizations.

Programming :

Proficient in Python and C++ or CUDA for high-performance computing.

Optimization Techniques :

Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).

Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.

Soft Skills :

Analytical problem-solving skills and a focus on performance improvement.

Strong collaboration and communication skills across teams.

Nice-to-Have

Familiarity with graph optimization and compiler-level performance tuning.

Contributions to open-source deep learning or distributed training projects.

Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Software Development

Referrals increase your chances of interviewing at Together AI by 2x

San Francisco, CA $167,000.00-$185,500.00 6 days ago

San Francisco, CA $130,000.00-$145,000.00 2 weeks ago

Staff Optimization Engineer, Dynamic Pricing

San Francisco, CA $223,000.00-$248,000.00 13 hours ago

San Francisco, CA $120,000.00-$180,000.00 4 months ago

Machine Learning Engineer, Forecast Platform

San Francisco, CA $198,000.00-$220,000.00 5 days ago

Machine Learning Engineer II - Autonomous Mobility and Delivery

San Francisco, CA $167,000.00-$185,500.00 3 days ago

Oakland, CA $90,000.00-$122,000.00 12 hours ago

San Francisco, CA $120,000.00-$160,000.00 2 weeks ago

San Francisco, CA $217,400.00-$294,100.00 14 hours ago

San Francisco, CA $209,700.00-$283,800.00 14 hours ago

GenAI Staff Machine Learning Engineer, Performance Optimization

San Francisco, CA $149,998.00-$250,000.00 9 months ago

San Francisco, CA $117,000.00-$150,000.00 1 month ago

Process Engineer, application via RippleMatch

San Francisco, CA $75,000.00-$150,003.00 10 months ago

San Francisco, CA $117,000.00-$150,000.00 3 weeks ago

Software Engineer, Performance Optimization

Redwood City, CA $175,000.00-$220,000.00 1 month ago

Process Engineer, application via RippleMatch

Redwood City, CA $142,000.00-$158,000.00 3 weeks ago

Staff Deep Learning Engineer, Perception

San Francisco, CA $193,375.00-$227,500.00 5 months ago

San Mateo, CA $233,840.00-$283,780.00 3 days ago

San Francisco, CA $100,000.00-$150,000.00 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

serp_jobs.job_alerts.create_a_job

Llm Engineer • San Francisco, CA, United States

Job_description.internal_linking.related_jobs

ML Systems Engineer : Distributed LLM Training & Inference

Scale AI • San Francisco, CA, United States

serp_jobs.job_card.full_time

A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experien...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

DevOps Engineer - Developer Experience & Tooling

Candid Health • Menlo Park, California, United States

serp_jobs.job_card.full_time

Candid Health is seeking a new engineer for our DevOps team.In this role, you'll work with incredibly capable engineers and operators who are intent on tackling some of the hardest and most meaning...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Infrastructure Engineer

FAR.AI • Berkeley, California, United States

serp_jobs.job_card.full_time

AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone.Our mission is to facilitate breakthrough AI safety research, advance global understan...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

EH&S Training Lead (4164U) - #82663

University of California-Berkeley • Berkeley, CA, United States

serp_jobs.job_card.full_time +1

At the University of California, Berkeley, we are dedicated to fostering a community where everyone feels welcome and can thrive. Our culture of openness, freedom and belonging make it a special pla...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Linux System / Platform Engineer

Lawrence Berkeley National Laboratory • Berkeley, CA, United States

serp_jobs.job_card.full_time

The National Energy Research Scientific Computing Center (NERSC) is seeking a versatile Linux System / Platform Engineer to join our team building and managing Linux-based infrastructure.More than ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

ML Research Engineer - Training

Achira • San Francisco, CA, United States

serp_jobs.job_card.full_time

Join a world‑class team of scientists, ML researchers, and engineers working together to make the physical microcosm predictable and reshape the future of drug discovery. Move beyond the beaten path...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Product Development Engineer, Reagents

Bruker • Emeryville, CA, United States

serp_jobs.job_card.full_time +1

Product Development Engineer, Reagents.Bruker is enabling scientists to make breakthrough discoveries and develop new applications that improve the quality of human life. Bruker's high-performance s...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior Manager, REMS Data Programmer

Jazz Pharmaceuticals • Menlo Park, California, USA

serp_jobs.job_card.full_time

If you are a current Jazz employee please apply via the Internal Career site.Jazz Pharmaceuticals is a global biopharma company whose purpose is to innovate to transform the lives of patients and ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Transfer List

City of Vallejo • Vallejo, CA, United States

serp_jobs.job_card.permanent

ONLY PERMANENT CITY OF VALLEJO EMPLOYEES WHO HAVE COMPLETED PROBATION MAY REQUEST TO BE PLACED ON THE TRANSFER LIST.The Transfer List is a way for current City of Vallejo employees to be able to ex...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

LVN / LPN

Zenex Partners • Vallejo, CA, United States

serp_jobs.job_card.full_time

Shift : 5 x 8 Days M-F; alternating bi - weekly schedules (shift 1 : 0600-1430, shift 2 : 0930-1800).On Call / Weekend Requirements : none. SUMMARY : Provides nursing care, under the direct supervision of ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Distributed Training Engineer

Periodic Labs • Menlo Park, CA, United States

serp_jobs.job_card.full_time

We are an AI + physical sciences lab building state of the art models to make novel scientific discoveries.We are well funded and growing rapidly. Team members are owners who identity and solve prob...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Leadership Development Trainer

Telecare Corporation • Alameda, CA, United States

serp_jobs.job_card.full_time

Telecare's mission is to deliver excellent and effective behavioral health services that engage individuals with complex needs in recovering their health, hopes, and dreams.Telecare continues to ad...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

LVN, Home Health

Sutter Health • Berkeley, CA, United States

serp_jobs.job_card.full_time

We are so glad you are interested in joining Sutter Health! .SCAH-Sutter Care at Home - Bay.Primary coverage area will be northern Alameda County. Under the general direction of a qualified register...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

LLM Engineer (Alameda)

PeopleCaddie • Alameda, CA, US

serp_jobs.job_card.part_time

San Jose, CA (Bay Area) - 2x / wk at client site.Up to $90 per hour (W2), depending on experience.Month (with possible extension). We are seeking a highly experienced.The ideal candidate has deep tech...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Senior Site Reliability Engineer Cloud Platform

Zilliz • Redwood City, California, United States

serp_jobs.job_card.full_time

Zilliz is a fast-growing startup developing the industry’s leading .Founded by the engineers behind Milvus, the world’s most popular . On a mission to democratize AI, Zilliz is committed to simplify...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

Senior / Lead Site Reliability Engineer Federal

C3 Ai • Redwood City, California, United States

serp_jobs.job_card.full_time

C3 AI (NYSE : AI), is the Enterprise AI application software company.C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing,...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted

LVN / LPT - Mental Health 105

2025 July 17th Virtual Fair - TELEPORT • Redwood City, CA, United States

serp_jobs.job_card.part_time

Sage House is a licensed 16-bed mental health rehabilitation center (MHRC) serving San Mateo County residents, 1859, with long histories of mental illness and multiple episodes of acute psychiatric...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted

Software Engineer III, Omni

Box • Redwood City, California, United States

serp_jobs.job_card.full_time

Box (NYSE : BOX) is the leader in Intelligent Content Management.Our platform enables organizations to fuel collaboration, manage the entire content lifecycle, secure critical content, and transform ...serp_jobs.internal_linking.show_more

serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted