Talent.com
Cluster Infrastructure Engineer
Cluster Infrastructure EngineerCartesia • San Francisco, California, United States
Cluster Infrastructure Engineer

Cluster Infrastructure Engineer

Cartesia • San Francisco, California, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About Cartesia

Our mission is to build the next generation of AI : ubiquitous, interactive intelligence that runs wherever you are. Today, not even the best models can continuously process and reason over a year-long stream of audio, video and text—1B text tokens, 10B audio tokens and 1T video tokens—let alone do this on-device.

We're pioneering the model architectures that will make this possible. Our founding team met as PhDs at the Stanford AI Lab, where we invented State Space Models or SSMs, a new primitive for training efficient, large-scale foundation models. Our team combines deep expertise in model innovation and systems engineering paired with a design-minded product engineering team to build and ship cutting edge models and experiences.

We're funded by leading investors at Index Ventures and Lightspeed Venture Partners, along with Factory, Conviction, A Star, General Catalyst, SV Angel, Databricks and others. We're fortunate to have the support of many amazing advisors, and 90+ angels across many industries, including the world's foremost experts in AI.

About the Role

We’re looking for a Cluster Infrastructure Engineer to help build and scale the compute backbone that powers Cartesia’s research on real-time, multimodal intelligence. In this role, you’ll work at the intersection of distributed systems and infrastructure engineering, designing and operating the large-scale GPU clusters that train and serve Cartesia’s foundation models. You’ll own systems that need to be fast, reliable, and highly automated — ensuring our researchers and product teams can move at the speed of innovation. You’ll build the tooling, automation, and monitoring needed to keep clusters resilient under load, quickly diagnose and resolve issues, and continuously push the boundaries of scalability and efficiency.

Your Impact

Design and build large-scale GPU clusters for model training and low-latency inference

Develop automation for provisioning, scaling, and monitoring to ensure clusters are fast, resilient, and self-healing

Collaborate closely with research and product teams to enable distributed training at scale, optimizing for speed, reliability, and utilization

Implement robust observability and alerting systems to monitor GPU health, node stability, and job performance

Diagnose and triage hardware, networking, and distributed training issues across environments, coordinating with provider support as needed

Continuously improve cluster reliability, developer ergonomics, and overall system efficiency across Cartesia’s research and production workloads

What You Bring

Strong engineering fundamentals and experience building and operating large-scale distributed systems

Deep familiarity with HPC & GPU cluster management using Kubernetes and Slurm

A blend of developer empathy and raw performance engineering, designing systems and tools that are intuitive to use and fast

Ability to balance principled engineering with the urgency of keeping mission-critical systems alive

Proficiency with Infrastructure-as-Code tools (Terraform, Ansible, etc.) and observability tools (Prometheus, Grafana, etc.)

Strong debugging skills— comfortable diagnosing NCCL issues, CUDA errors, and network or driver-level faults.

What Sets You Apart

Experience optimizing large-scale distributed training frameworks such as DeepSpeed, Megatron-LM, or similar

Familiarity with advanced parallelization techniques such as FSDP, context parallelism, or tensor parallelism

Our culture

🏢 We’re an in-person team based out of San Francisco. We love being in the office, hanging out together and learning from each other everyday.

🚢 We ship fast. All of our work is novel and cutting edge, and execution speed is paramount. We have a high bar, and we don’t sacrifice quality and design along the way.

🤝 We support each other. We have an open and inclusive culture that’s focused on giving everyone the resources they need to succeed.

serp_jobs.job_alerts.create_a_job

Infrastructure Engineer • San Francisco, California, United States

Job_description.internal_linking.related_jobs
Infrastructure Engineer

Infrastructure Engineer

FAR.AI • Berkeley, California, United States
serp_jobs.job_card.full_time
AI is a non-profit AI research institute dedicated to ensuring advanced AI is safe and beneficial for everyone.Our mission is to facilitate breakthrough AI safety research, advance global understan...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Roboflow • San Francisco, California, USA
serp_jobs.job_card.full_time
Our mission is to make the world programmable.Sight is one of the key ways we understand the world and soon this will be true for the software we use too. Were building the tools community and resou...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
IT Infrastructure Engineer III

IT Infrastructure Engineer III

Prometheus Real Estate Group • San Mateo, California, United States
serp_jobs.job_card.full_time
Founded in 1965, Prometheus is the largest privately held owner of apartments in the San Francisco Bay Area, with a portfolio of over 13,000 apartments in the Silicon Valley, Portland and Seattle r...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Infrastructure Engineer

Staff Infrastructure Engineer

Ironclad • San Francisco, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time
Ironclad is the #1 contract lifecycle management platform for innovative companies.Every company, in every country, in every industry runs on contracts, but managing these contracts slows companies...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Staff Infrastructure Engineer

Staff Infrastructure Engineer

Replit • Foster City, California, United States
serp_jobs.job_card.full_time
Replit is the agentic software creation platform that enables anyone to build applications using natural language.With millions of users worldwide and over 500,000 business users, Replit is democra...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Platform Engineer

Infrastructure Platform Engineer

NS IT Solutions • San Francisco, California, USA
serp_jobs.job_card.full_time
Title : Infrastructure / Platform Engineer (AI Voice & Social Product) - w / Equity.Location : San Francisco CA (onsite 5 days a week). As a Founding Infrastructure / Platform Engineer oversee cloud da...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Lead Cloud Infrastructure Engineer

Lead Cloud Infrastructure Engineer

Together Ai • San Francisco, California, United States
serp_jobs.job_card.full_time
Together AI is hiring a Lead Cloud Infrastructure Engineer to own and operate the cloud foundation that powers our rapidly scaling data platforms. In this role, you will be the primary engineer resp...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Platform & Infrastructure Engineer

Platform & Infrastructure Engineer

Mindsdb • San Francisco, California, United States
serp_jobs.job_card.full_time
MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applications then unifies structured an...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Cloud Infrastructure Engineer

Cloud Infrastructure Engineer

Florvets Structures • San Francisco, California, United States
serp_jobs.filters.remote
serp_jobs.job_card.full_time +1
Position : Cloud Infrastructure Engineer.Florvets Structures is a leading construction and engineering company based in San Francisco, California. We specialize in building innovative and sustainable...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Mercor • San Francisco, California, United States
serp_jobs.job_card.full_time
Mercor is training models that predict how well someone will perform on a job better than a human can.We use our platform to source, vet, and onboard expert contractors who help train AI models in ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Retool • San Francisco, California, United States
serp_jobs.job_card.full_time
Nearly every company in the world runs on custom software for critical operations like tracking performance metrics, handling customer support workflows, building admin dashboards, and countless ot...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Infrastructure Engineer

Infrastructure Engineer

Vibecode • San Francisco, California, United States
serp_jobs.job_card.full_time
We're democratizing software creation.Our platform lets anyone describe an idea and instantly turn it into a working application—no coding required. We're solving one of computing's fundamental chal...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Lead Infrastructure Engineer

Lead Infrastructure Engineer

PIP Labs • San Francisco, California, United States
serp_jobs.job_card.full_time
Story aims to grow the creativity of the internet.The internet has introduced Story is building the IP infrastructure for the internet era, where creativity and intelligence move at the speed of cu...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
Senior Infrastructure Engineer - InfraOps

Senior Infrastructure Engineer - InfraOps

Bitgo • San Francisco, California, United States
serp_jobs.job_card.full_time
BitGo is the leading infrastructure provider of digital asset solutions, delivering custody, wallets, staking, trading, financing, and settlement services from regulated cold storage.Since our foun...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
MTS, Infrastructure Engineer

MTS, Infrastructure Engineer

Delphina • San Francisco, California, United States
serp_jobs.job_card.full_time
Today’s Data Scientists are in pain - spending their time manually wrangling data, building models through slow trial and error, taking on painstaking rewrites for deployment, and dealing with coun...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
ML Infrastructure Engineer

ML Infrastructure Engineer

Phizenix • Menlo Park, California, United States
serp_jobs.job_card.full_time +1
Menlo Park, CA | On-Site | Full-Time / Direct Hire.Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an AI startup pioneering ...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
ML Infrastructure Engineer

ML Infrastructure Engineer

Virtue AI • San Francisco, California, United States
serp_jobs.job_card.full_time
Virtue AI is at the forefront of AI security.As enterprises increasingly adopt Large Language Models, the need for robust, trustworthy, and safe AI has never been greater.Our mission is to build th...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
Principal Infrastructure Engineer

Principal Infrastructure Engineer

Nextdata Technologies Inc • San Francisco, California, United States
serp_jobs.job_card.full_time
The future of data lies in decentralization, and the concept of a data mesh is the proven approach for implementing this at Enterprise scale. We’re here to make it a reality.Nextdata OS is a data-me...serp_jobs.internal_linking.show_more
serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted