Talent.com
Lead Site Reliability Engineer
Lead Site Reliability EngineerBridge Defense • Washington, DC, United States
Lead Site Reliability Engineer

Lead Site Reliability Engineer

Bridge Defense • Washington, DC, United States
job_description.job_card.variable_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

About Bridge Defense.

Bridge Defense is redefining how modern defense technology is delivered. Based in Washington, D.C., we are built for the dynamic mission environment facing the Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national security solutions that combine secure infrastructure, cleared talent, and mission-ready software to meet evolving defense challenges. Our services include secure software development in classified environments and the design and implementation of advanced IT and cybersecurity capabilities ranging from secure cloud architectures and enterprise infrastructure to data center operations, scientific analysis, and cutting‑edge cyber defense.

We are led by technologists and veterans with firsthand mission experience, which enables us to understand both the operational realities and the innovation needed to succeed. Our approach is agile and outcome‑based, delivering results in weeks rather than months whenever possible.

At Bridge Defense we value people, integrity, and excellence. We foster an environment where innovation thrives in support of traditional mission requirements. Our team members receive competitive compensation, robust benefits, professional development and certification opportunities, and clear paths for growth while working on the nation’s most critical projects.

Core Values :

  • Innovation & Responsiveness : We push beyond legacy models with efficient, tech‑led solutions built to scale and evolve.
  • Trusted Performance : Security, compliance, and deep experience in delivering to demanding environments guides all we do.
  • Mission Focused Expertise : From veteran leadership to cleared engineers, our people understand both the technology and the mission.

About the Role

As the Lead Site Reliability Engineer for our ComputeBridge Engagement, you’ll be responsible for the reliability, scalability, and performance of one of the largest hardware and AI infrastructure efforts in the U.S. defense sector. You will lead the deployment, management, and automation of a high‑performance computing mesh across multiple secure environments, ensuring operational excellence and mission continuity for a 9‑figure government program.

This is a hands‑on engineering leadership role that bridges physical infrastructure and modern DevOps automation, ideal for someone who thrives at the intersection of hardware systems, distributed computing, and AI / ML workflows.

What You’ll Do

  • Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments
  • Install and configure physical systems, including high‑density GPU servers, networking gear, and storage arrays
  • Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms
  • Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)
  • Operate and maintain distributed networking meshes across multiple classified and unclassified domains
  • Implement and manage out‑of‑band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control
  • Integrate and optimize NVIDIA GPU infrastructure for AI / ML training and inference workloads
  • Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance
  • Provide on‑site technical leadership for deployments, troubleshooting, and continuous improvement

  • Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows
  • What You’ll Bring

  • 3+ years of experience in site reliability, systems engineering, or hardware operations roles
  • Deep expertise with physical infrastructure : server racking, cabling, diagnostics, and troubleshooting
  • Strong experience with Linux systems administration, imaging, and automated deployment
  • Hands‑on experience managing large‑scale clusters or distributed systems in OpenShift or Kubernetes environments
  • Familiarity with DevOps automation (Ansible, Terraform, CI / CD pipelines)
  • Experience configuring and managing networking and mesh architectures

  • Direct experience with NVIDIA GPUs, CUDA, and related AI / ML frameworks
  • Proficiency with out‑of‑band management and IMPI / iDRAC tooling
  • Certifications : Linux+ and Security+ (required or in‑progress)
  • Excellent communication, documentation, and problem‑solving skills
  • Clearance : Active TS / SCI required or ability to obtain
  • Bonus Points For

  • Experience operating in secure DoD or intelligence environments
  • Familiarity with Palantir platforms or other government data systems
  • Prior experience supporting AI / ML infrastructure in production or tactical settings
  • Experience with performance tuning and monitoring of HPC or GPU‑accelerated clusters
  • General Factors :

  • Depending on project requirements, may be required to work within a compressed schedule; overtime should be expected when schedules demand it.
  • Willing to travel, if needed.
  • No Relocation.
  • Why Bridge Defense

  • Shape how advanced computing supports national security missions at scale
  • Lead engineering for a major government program with direct mission impact
  • Competitive compensation, benefits, and growth opportunities in a mission‑driven environment
  • Bridge Defense is committed to building a collaborative and mission‑focused team. Bridge Defense reserves the right to modify job duties or requirements at any time. Employment with Bridge Defense is at‑will. Candidates must be eligible to work in the United States and complete any required background checks or security clearance processes as a condition of employment.

    #J-18808-Ljbffr

    serp_jobs.job_alerts.create_a_job

    Site Reliability Engineer • Washington, DC, United States

    Job_description.internal_linking.related_jobs
    Lead Site Reliability Engineer

    Lead Site Reliability Engineer

    Federated IT • Washington, DC, United States
    serp_jobs.job_card.full_time
    Bridge Defense is redefining how modern defense technology is delivered.Department of Defense, the Intelligence Community, and federal law enforcement agencies. We provide full-spectrum national sec...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Govx • Washington, District of Columbia, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    GOVX is seeking an experienced Senior Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our production systems through automation, observability, and operat...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    LIGHTFEATHER IO LLC • Alexandria, VA, US
    serp_jobs.job_card.full_time
    LightFeather is seeking a Site Reliability Engineer (SRE) with strong GitLab platform expertise to support and enhance enterprise DevSecOps and collaboration environments.The ideal candidate thrive...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Canonical • Washington, DC, United States
    serp_jobs.job_card.full_time
    Canonical is a leading provider of open source software and operating systems.Our platform, Ubuntu, is used across enterprise initiatives in public cloud, data science, AI, engineering innovation, ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Reliability Engineer

    Reliability Engineer

    Lockheed Martin • Bethesda, MD, United States
    serp_jobs.job_card.full_time
    Lockheed Martin is a global security and aerospace company that employs some of the greatest minds in the industry.They are passionate about purposeful innovation, dedicated to keeping people safe ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Anduril Industries • Washington, District of Columbia, United States
    serp_jobs.job_card.full_time
    The Platform Discovery team at Anduril is at the forefront of incubating and maturing high-potential, software-defined, AI-native offerings that meet the toughest, newest challenges across hardware...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer - TS / SCI

    Site Reliability Engineer - TS / SCI

    Mission Box Technologies • Washington, District of Columbia, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    We are seeking a Site Reliability Engineer (SRE).The ideal candidate will drive continuous improvements, ensuring robust and reliable technology services. For over two decades, our client has been c...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    EngFlow • Washington, DC, United States
    serp_jobs.job_card.full_time
    Join to apply for the Site Reliability Engineer role at EngFlow.At EngFlow, we help developers save time by accelerating software builds and tests. Our cloud-based, distributed service optimizes dev...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Principal Site Reliability Engineer (SRE) at Jobgether Washington DC

    Principal Site Reliability Engineer (SRE) at Jobgether Washington DC

    Jobgether • Washington, DC, United States
    serp_jobs.job_card.full_time
    Principal Site Reliability Engineer (SRE) job at Jobgether.This position is posted by Jobgether on behalf of.We are currently looking for a. Principal Site Reliability Engineer (SRE).Join a high-imp...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Deployment Site Reliability Engineer - Connected Warfare

    Deployment Site Reliability Engineer - Connected Warfare

    Anduril Industries, Inc. • Washington, DC, United States
    serp_jobs.job_card.full_time
    Senior Deployed Site Reliability Engineer, Connected Warfare.Washington, District of Columbia, United States.Anduril Industries is a defense technology company with a mission to transform U.By brin...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • Washington, DC, United States
    serp_jobs.job_card.full_time
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer, Platform Discovery

    Site Reliability Engineer, Platform Discovery

    Slope • Washington, DC, United States
    serp_jobs.job_card.full_time
    Anduril Industries is a defense technology company with a mission to transform U.By bringing the expertise, technology, and business model of the 21st century’s most innovative companies to the def...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Site Reliability Engineer

    Site Reliability Engineer

    ClientMind Recruiting Inc. • Bethesda, MD, United States
    serp_jobs.job_card.full_time
    Clientmind Recruiting is searching for a Site Reliability Engineer for a growing tech company based in the Bethesda, MD area. This will be onsite 1x per week (Tuesday).This role centers on maintaini...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    ConMon Lead

    ConMon Lead

    Leidos • Alexandria, VA, US
    serp_jobs.job_card.full_time
    The ConMon Services Lead / SME will be directly engaged with DISA’s Risk Management Executive’s (RME) Continuous Monitoring program to ensure success. This role will be based onsite in Ale...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Ten Mile Square Technologies • Arlington, Virginia, United States
    serp_jobs.filters.remote
    serp_jobs.job_card.full_time
    Ten Mile Square Technologies is a high-end technology consulting firm based in the Northern Virginia area.Our customers routinely call upon us to solve some of the largest scale and hardest problem...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Senior Reliability Engineer

    Senior Reliability Engineer

    The Johns Hopkins University Applied Physics Laboratory • Laurel, MD, United States
    serp_jobs.job_card.full_time
    Are you passionate about applying reliability and system engineering principles to analyze and assess the resilience of future strategic weapon systems?. Do you have a strong technical background in...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted