Talent.com
Site Reliability Engineer (SRE)
Site Reliability Engineer (SRE)Baseten • San Francisco, CA, United States
Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Baseten • San Francisco, CA, United States
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About Baseten

Baseten powers inference for the world's most dynamic AI companies, like OpenEvidence, Clay, Mirage, Gamma, Sourcegraph, Writer, Abridge, Bland, and Zed. By uniting applied AI research, flexible infrastructure, and seamless developer tooling, we enable companies operating at the frontier of AI to bring cutting‑edge models into production. With our recent $150M Series D funding, backed by investors including BOND, IVP, Spark Capital, Greylock, and Conviction, we’re scaling our team to meet accelerating customer demand.

The Role

As a Site Reliability Engineer, you'll envision and build robust systems and processes that ensure our infrastructure is scalable, reliable, and efficient. This can range from automating deployments and monitoring systems to optimizing performance and managing incidents.

We all work closely with our users, learning from their past struggles in operationalizing ML, onboarding them onto our platform, and turning our learnings into ideas for improving Baseten.

Example Initiatives

  • Multi‑cloud capacity management
  • Inference on B200 GPUs
  • Multi‑node inference
  • Fractional H100 GPUs for efficient model serving

Responsibilities

  • Build and maintain scalable infrastructure to support the deployment and operation of machine learning models.
  • Establish standards and best practices for reliability and performance across the infrastructure.
  • Automate processes when relevant, particularly for managing CI / CD pipelines.
  • Own products and projects end‑to‑end, functioning as both an engineer and a project manager, with a focus on user empathy, project specification, and end‑to‑end execution.
  • Collaborate with cross‑functional teams to understand project requirements and translate them into technical solutions.
  • Mentor junior team members and contribute to knowledge sharing within the organization.
  • Navigate ambiguity and exercise good judgment on tradeoffs and tools needed to solve problems, avoiding unnecessary complexity.
  • Demonstrate pride, ownership, and accountability for your work, expecting the same from your teammates.
  • Requirements

  • Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field.
  • 5+ years of professional work experience in a fast‑paced, high‑growth environment.
  • Extensive experience with Kubernetes.
  • Experience in building and maintaining scalable infrastructure.
  • Experience with infrastructure‑as‑code tools (e.g., Terraform, CloudFormation, Pulumi) and CI / CD tooling (e.g., GitHub Actions, GitLab CI, CircleCI, Jenkins).
  • Relevant OSS observability experience (Prometheus, ELK stack, Grafana stack, OpenTelemetry) is a plus.
  • Ability to own projects end‑to‑end, from project specification to execution.
  • No prior machine learning experience required, but should be open to learning about it.
  • Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents.
  • Generous PTO policy including company‑wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!).
  • Paid parental leave.
  • Company‑facilitated 401(k).
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
  • Apply now to embark on a rewarding journey in shaping the future of AI! If you are a motivated individual with a passion for machine learning and a desire to be part of a collaborative and forward‑thinking team, we would love to hear from you.

    At Baseten, we are committed to fostering a diverse and inclusive workplace. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, genetic information, disability, or veteran status.

    #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Site Reliability Engineer • San Francisco, CA, United States

    Job_description.internal_linking.related_jobs
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Cloud Site Reliability Engineer (SRE)

    Cloud Site Reliability Engineer (SRE)

    Promise • Oakland, California, United States
    serp_jobs.job_card.full_time +1
    Promise empowers utilities and government agencies to create flexible, affordable solutions for individuals struggling with debt. Our innovative approach to payment plans and relief distribution sig...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Altana • San Francisco, California, United States
    serp_jobs.job_card.full_time
    AI can be a powerful tool for good in the world – at Altana we apply AI to the world’s largest organized body of supply chain data to power a more resilient, more secure, and more sustainable model...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer, Frontier Systems Infrastructure

    Site Reliability Engineer, Frontier Systems Infrastructure

    OpenAI • San Francisco, California, United States
    serp_jobs.job_card.full_time
    The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. We take data center designs, tur...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Conductorone • San Francisco, California, United States
    serp_jobs.job_card.full_time
    ConductorOne is the modern identity governance platform that makes it possible to move beyond the limitations of legacy IGA and reduce the identity attack surface with confidence.Designed for flexi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer - Inference

    Site Reliability Engineer - Inference

    Lambda • San Francisco, California, United States
    serp_jobs.job_card.full_time
    In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences.We began as an AI company built by AI engineers. Today, we're on a mission to be the world...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Software Engineer, Site Reliability Engineer (SRE)

    Software Engineer, Site Reliability Engineer (SRE)

    Harvey • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Harvey is a secure AI platform for legal and professional services that augments productivity and automates complex workflows. Harvey uses algorithms with reasoning-adept LLMs that have been customi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    Air Apps, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    At Air Apps, we believe in thinking bigger—and moving faster.We’re a family-founded company on a mission to create the world’s first AI-powered Personal & Entrepreneurial Resource Planner (PRP), an...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Sr. Site Reliability Engineer

    Sr. Site Reliability Engineer

    Prosper • San Francisco, California, United States
    serp_jobs.job_card.full_time
    As a Senior Site Reliability Engineer (SRE) at Prosper, you will be instrumental in enhancing the reliability, scalability, and maintainability of our technology platform.This role bridges the gap ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Zoox • Foster City, California, United States
    serp_jobs.job_card.full_time
    Zoox is looking for a platform / site reliability engineer who will be responsible for measuring and maintaining the uptime of the many services critical to the development process for autonomous veh...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Redwood Materials • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling .Responsibilities will include : . Collect business & technical requirements and work wit...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Loft Orbital Solutions • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Loft Orbital builds a space infrastructure providing a fast & simple path to orbit.We operate satellites, fly customer payloads onboard and handle the entire mission from initial concept through in...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer, Supply

    Senior Site Reliability Engineer, Supply

    Mithril • San Francisco Bay, California, United States
    serp_jobs.job_card.full_time
    Mithril is actively seeking talented candidates at the Senior to Principal level, with leveling determined based on experience and demonstrated expertise. We welcome individuals who bring deep techn...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Checkr • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Checkr is building the data platform to power safe and fair decisions.Established in 2014, Checkr’s innovative technology and robust data platform help customers assess risk and ensure safety and c...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Replit • Foster City, California, United States
    serp_jobs.job_card.full_time
    Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Visa • Foster City, California, United States
    serp_jobs.job_card.full_time
    Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more t...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Latent • San Francisco, California, United States
    serp_jobs.job_card.full_time
    San Francisco, CA (5 Days In-Office).You are the infrastructure expert who enables our rapid product development and guarantees. AI platform for major health systems.Your focus on operational excell...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Founding Site Reliability Engineer

    Founding Site Reliability Engineer

    Assort Health • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Our mission is to make exceptional healthcare accessible anytime, anywhere, for everyone.That’s why we’re building a new foundation for how patients and providers connect, driven by AI, built to em...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted