Talent.com
Staff Site Reliability Engineer
Staff Site Reliability EngineerTopstep • United States, United States, United States
Staff Site Reliability Engineer

Staff Site Reliability Engineer

Topstep • United States, United States, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Summary

Are you a systems-minded engineer who thrives on building resilient infrastructure, driving operational excellence, and enabling teams to move fast with confidence? As a Staff Site Reliability Engineer at Topstep, you'll play a foundational role in shaping how we approach reliability, observability, and infrastructure at scale. You'll be instrumental in building out our SRE practice, defining our incident response culture, closing observability gaps, and optimizing our AWS infrastructure for both performance and cost. This role is ideal for someone who brings both deep technical expertise and a builder's mindset. Someone who's excited to establish best practices from the ground up, embed reliability into engineering culture, and create the foundations that let teams ship with speed and confidence. Join us and help define what operational excellence looks like at Topstep.

Key Responsibilities

  • Set technical direction for reliability and observability across the entire engineering organization, influencing architectural decisions.
  • Build and mature our SRE practice defining SLOs, incident response protocols, and on-call standards
  • Own the observability stack using DataDog (primary platform for metrics, APM, logging) and CloudWatch (AWS-native monitoring), instrumenting distributed tracing and closing gaps that currently prevent diagnosis of production issues
  • Partner with engineering teams to embed reliability principles early in the design process and improve system resilience
  • Lead incident response and blameless post-mortems , turning outages into opportunities for systematic improvement
  • Mentor engineers across the organization on reliability practices, operational thinking, and production ownership
  • Champion a culture of transparency, continuous improvement, and shared ownership of production systems

Required Qualifications and Key Competencies

  • 7+ years of professional experience in SRE, infrastructure, or platform engineering, with demonstrated impact building practices that scaled across multiple teams
  • Proven track record either starting an SRE function from scratch or scaling an existing practice with measurable improvements to MTTR, MTTD, change failure rate, or availability
  • Strong proficiency with DataDog for end-to-end observability (metrics, APM, logs, distributed tracing) and building alerting that catches real issues without causing fatigue
  • Deep expertise with AWS infrastructure (EKS, ECS, EC2, and RDS) running production services at scale, and hands-on experience optimizing for both reliability and cost
  • Solid foundation in distributed systems, networking, database performance, and debugging complex system failures across service boundaries
  • Comfortable reading code, writing automation scripts, and contributing to infrastructure tooling when needed
  • Proficiency with infrastructure as code (Terraform) and GitOps practices
  • Track record of influencing engineering culture through documentation, tooling, mentorship, and technical leadership
  • Excellent communication skills with the ability to explain complex system behavior and trade-offs to varied audiences
  • Comfortable making pragmatic trade-offs between long-term platform vision and immediate business needs
  • Company Culture & Perks

  • Topstep is an engaging working environment which ranges from fully remote to hybrid. We foster a culture of collaboration with cameras on during meetings and a robust Slack environment for communication.
  • 10 Company paid Holidays and generous Family Leave. Paid time off is accrued monthly.
  • Competitive 401(k) matching, health, dental, and vision insurance is offered for full time employees
  • Vacations are encouraged with a bonus for taking 5 consecutive days. Employee referrals are bonused. Topstep offers a food and groceries budget and contributes towards health and wellness.
  • New Hire Base Salary Range

  • $200,000-$250,000
  • Bonus : This position is eligible for a performance-based bonus as provided by the plan terms and governing documents.
  • The compensation offered will take into account internal compensation structure and may vary depending on the candidate's geographic region, job-related knowledge, skills, and experience among other factors.
  • Equal Opportunity Employer

    Topstep is an Equal Opportunity Employer. We are committed to fostering an inclusive environment where all employees and applicants are valued. All qualified candidates will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, age, disability, or veteran status, in compliance with applicable federal, state, and local laws.

    Interested in the role? Apply today with your resume and cover letter!

    At this time immigration sponsorship is not available for this position (including H-1B, STEM OPT training plans, etc.).

    serp_jobs.job_alerts.create_a_job

    Site Reliability Engineer • United States, United States, United States

    Job_description.internal_linking.related_jobs
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Expel • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Your passion for uptime was forged from experience in production and refined through incident response.You’re an Expel Principal Site Reliability Engineer - a protector, champion, and leader of Exp...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Real Time Technologies • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Realtime technologies, LLC offers the most flexible cutting-edge Retail Management Solutions that encompass sales, inventory management, frontline employee management and engagement, payments, busi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Palmetto Clean Technology • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Palmetto is a leading clean tech company on a mission to accelerate the transition to a clean energy future.With a belief that consumers can. Our award-winning technology platform empowers homeowner...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    Lightfeather.io • United States, United States, United States
    serp_jobs.job_card.full_time
    LightFeather is seeking a Site Reliability Engineer (SRE) with strong GitLab platform expertise to support and enhance enterprise DevSecOps and collaboration environments.The ideal candidate thrive...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer - Platform

    Staff Site Reliability Engineer - Platform

    Ionq • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    IonQ is developing the world's most powerful full-stack quantum computer based on trapped-ion technology.We are pushing past the limits of classical physics and current supercomputing technology to...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Cutover • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    An inclusive work environment is an empowering one.At Cutover, we lead with empathy and enable others to succeed through curiosity, kindness, and self-expression. Location : Remote, United States.Shi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Saferide Health • United States, United States, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    SafeRide Health is seeking a Site Reliability Engineer to develop and implement new processes that support software delivery excellence and operational discipline, to ensure that user-facing servic...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior / Principal Site Reliability Engineer

    Senior / Principal Site Reliability Engineer

    Datacrunch • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    Imagine a future where everyone has instant, low-cost access to intelligence.We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI m...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Principal Site Reliability Engineer

    Principal Site Reliability Engineer

    Blue River Technology • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    We’re Blue River, a team of innovators driven to create intelligent machinery that solves monumental problems for our customers. We empower our customers – farmers, construction crews, and foresters...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Sentinelone • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Please note that under Federal & FedRAMP regulations, hiring for this role is limited to US citizens only.FedRamp Staff may be subject to customer or third-party background checks up to and includi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer - Growth

    Senior Site Reliability Engineer - Growth

    Kraken • United States, United States, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Building the Future of Crypto .Our Krakenites are a world-class team with crypto conviction, united by our desire to discover and unlock the potential of crypto and blockchain technology.Kraken is ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Staff Site Reliability Engineer - Spacetime

    Staff Site Reliability Engineer - Spacetime

    Aalyria • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    This isn't a "keep the lights on" SRE role.This is a strategic, high-impact opportunity to build the nervous system for a platform that transforms how networks of satellites, ground stations, and f...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    EngFlow Inc. • US
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    serp_jobs.filters_job_card.quick_apply
    Our cloud-based, distributed service optimizes developer workflows through remote execution and caching, improving efficiency, productivity, and product quality. Backed by top investors, EngFlow is ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30
    Senior Site Reliability Engineer, Arlington

    Senior Site Reliability Engineer, Arlington

    Onebrief • Remote, Remote, United States
    serp_jobs.job_card.full_time
    Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs.By transforming this work, Onebrief makes the staff as a whole superhuman - meaning faster, smar...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Staff / Principal Site Reliability Engineer

    Staff / Principal Site Reliability Engineer

    Veza Technologies • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Staff / Principal Site Reliability Engineer.You'll architect scalable solutions, navigate complex technical challenges independently, and deliver results under tight deadlines in a fast-paced environ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Patreon • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Patreon is a media and community platform where over 300,000 creators give their biggest fans access to exclusive work and experiences. We offer creators a variety of ways to engage with their fans ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Sciencelogic • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    ScienceLogic is redefining IT operations for the modern enterprise.Our AIOps platform empowers organizations to achieve Autonomic IT — where systems are self-healing, self-optimizing, and seamlessl...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer - Spacetime

    Site Reliability Engineer - Spacetime

    Aalyria • Remote, Remote, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time +1
    This isn't a "keep the lights on" SRE role.This is a strategic, high-impact opportunity to build the nervous system for a platform that transforms how networks of satellites, ground stations, and f...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted