Role : : Site Reliability Engineer (SRE)
Location : Vienna, VA (4 days onsite, 1 day remote)
Type : Full-time / Contract
Job Summary
We are looking for an experienced Site Reliability Engineer (SRE) to support and monitor critical production systems. The role focuses on 24x7 monitoring, reducing manual work through automation, managing Splunk, and supporting cloud-based Disaster Recovery and Business Continuity processes. You will work closely with Cloud, DevOps, and Application teams to ensure system reliability and availability.
Key Responsibilities
- Provide 24x7 production monitoring and support for critical systems.
- Meet SLAs and follow SRE best practices to reduce manual remediation (toil).
- Build automated remediation solutions to improve system stability.
- Administer and configure Splunk for monitoring and troubleshooting.
- Support gradual changes, application monitoring, and automation tasks.
- Participate in Business Continuity, Disaster Recovery (DR), and COOP activities.
- Perform system failover / switchover testing (Cold / Warm / Hot).
- Ensure high availability through fault tolerance, redundancy, and five 9s design.
- Monitor and resolve system data synchronization issues.
Required Skills & Experience
Bachelor’s degree in Computer Science or related field.6+ years of SRE experience in production environments.Strong experience with Splunk administration and configuration .Hands-on experience with DR, COOP, Business Continuity on cloud platforms.Good understanding of reliability engineering concepts (HA, redundancy, failover).Strong troubleshooting, problem-solving, and communication skills.Ability to work in a collaborative team environment.