Role Overview
We’re seeking a Site Reliability Engineer (SRE) with strong full-stack development expertise and hands-on experience in observability, automation, and reliability engineering. The ideal candidate will design monitoring solutions, optimize system performance, and drive reliability across distributed applications and infrastructure.
Must-Have Technical Skills (Level 3 – 5–7+ Years)
- Full Stack Development : Strong ability to navigate across front-end, back-end, and infrastructure layers for debugging and optimization.
- Observability : Deep understanding of logs, metrics, and traces for system monitoring and diagnostics.
Monitoring & Analysis Tools :
DynatraceBigPandaEvolvenThousandEyesNice to Have Skills
Advanced experience with Grafana or Kibana for analytics and visualization.Familiarity with cloud platforms (AWS / Azure / GCP) and infrastructure-as-code tools.Key Responsibilities
Define and implement standardized methods to collect and analyze logs, traces, and metrics across systems and applications.Develop dashboards and monitoring frameworks to improve visibility into system health and performance.Collaborate with development teams to enhance service reliability , optimize deployments, and streamline release processes.Conduct root cause analysis , performance tuning, and fault detection using observability tools.Participate in system design reviews, platform management, and capacity planning .Build automation pipelines to reduce manual operations, improve efficiency, and ensure sustainable systems.Establish and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure uptime and performance standards are met.Education
Bachelor’s degree preferred , but not required (Computer Science, Engineering, or related field).