NOTE : Candidates requiring sponsorship now or in the future (including CPT / OPT) cannot be considered for this job
No C2C
Candidates will be required to work on site 3 days per week in south Salt Lake County
Onsite interviews are required. Local candidates only
SRE / Platform Engineering Manager
Overview
We are seeking a hands-on Platform Engineering / SRE Manager to lead a small, high-impact team responsible for maintaining and improving the reliability, performance, and scalability of our production systems. This role blends technical leadership and operational excellence, managing a group of Site Reliability and Platform Engineers who ensure our applications and infrastructure run smoothly in production.
The ideal candidate is a player-coach, comfortable leading incident response efforts, mentoring engineers, and still contributing technically through infrastructure automation, observability improvements, and system reliability enhancements.
Key Responsibilities
- Lead and mentor a team of SREs and Platform Engineers (currently five members) focused on production stability, system automation, and operational readiness.
- Own the reliability lifecycle, driving proactive monitoring, on-call response leadership, and post-incident reviews to minimize downtime and improve service quality.
- Develop and evolve infrastructure automation using Terraform, Helm, and related Infrastructure-as-Code practices to standardize deployments and reduce manual interventions.
- Partner with product, software, and operations teams to implement scalable cloud solutions that meet performance and resiliency targets.
- Oversee observability and telemetry using tools like Grafana, Azure Insights, Datadog, or Dynatrace, ensuring comprehensive visibility into system health.
- Drive the definition and tracking of SLOs, SLIs, and SLAs, helping teams measure and continuously improve reliability standards.
- Collaborate with engineering leads to enhance developer platform capabilities like automating workflows, managing CI / CD pipelines, and simplifying environment provisioning.
What You’ll Bring
Bachelor’s degree in Computer Science, Information Technology, or equivalent practical experience.7+ years in infrastructure, SRE, or platform engineering roles, including 3+ years in leadership or team management.Strong background in cloud infrastructure (AWS, Azure, or GCP) and hands-on experience with IaC tools such as Terraform.Familiarity with CI / CD pipelines, container orchestration, and deployment frameworks (e.g., Jenkins, GitHub Actions, Kubernetes, Docker).Experience improving system observability, developing dashboards, and managing alerting systems using Grafana or similar platforms.Competence in Python, Go, or C# for automation and troubleshooting.Solid understanding of relational databases (SQL) and the ability to guide teams in identifying and resolving performance bottlenecks.Demonstrated ability to lead incident management, communicate effectively across teams, and create a culture of continuous improvement.Preferred Experience
Experience with developer enablement or internal platform engineering initiatives (e.g., self-service infrastructure or environment provisioning).Familiarity with data-driven operational metrics and applying analytics to improve system reliability.Prior experience managing a hybrid or remote technical team across time zones.Work Style
Approximately 30% hands-on technical contribution and 70% team leadership, process improvement, and coordination.Availability to participate in daytime and occasional off-hours on-call support rotations.Commitment to building a proactive, reliability-first culture that values automation, transparency, and cross-functional collaboration.