Lead Site Reliability Engineer
Company: Bridge Defense
Location: Washington
Posted on: April 2, 2026
|
|
|
Job Description:
About the Role As the Lead Site Reliability Engineer for our
ComputeBridge Engagement, you’ll be responsible for the
reliability, scalability, and performance of one of the largest
hardware and AI infrastructure efforts in the U.S. defense sector.
You will lead the deployment, management, and automation of a
high-performance computing mesh across multiple secure
environments, ensuring operational excellence and mission
continuity for a 9-figure government program. This is a hands-on
engineering leadership role that bridges physical infrastructure
and modern DevOps automation, ideal for someone who thrives at the
intersection of hardware systems, distributed computing, and AI/ML
workflows. What You’ll Do Lead infrastructure design, deployment,
and operations for ComputeBridge hardware clusters across secure
and distributed environments Install and configure physical
systems, including high-density GPU servers, networking gear, and
storage arrays Build and deploy secure Linux images and
containerized workloads using OpenShift and other orchestration
platforms Develop and manage automation pipelines for provisioning,
configuration management, and monitoring using modern DevOps
toolchains (Ansible, Terraform, etc.) Operate and maintain
distributed networking meshes across multiple classified and
unclassified domains Implement and manage out-of-band management
tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and
control Integrate and optimize NVIDIA GPU infrastructure for AI/ML
training and inference workloads Collaborate with mission
engineers, software teams, and government operators to ensure
system readiness and performance Provide on-site technical
leadership for deployments, troubleshooting, and continuous
improvement Mentor junior engineers and establish operational best
practices across the ComputeBridge program as the contract grows
What You’ll Bring 3 years of experience in site reliability,
systems engineering, or hardware operations roles Deep expertise
with physical infrastructure: server racking, cabling, diagnostics,
and troubleshooting Strong experience with Linux systems
administration, imaging, and automated deployment Hands-on
experience managing large-scale clusters or distributed systems in
OpenShift or Kubernetes environments Familiarity with DevOps
automation (Ansible, Terraform, CI/CD pipelines) Experience
configuring and managing networking and mesh architectures Direct
experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks
Proficiency with out-of-band management and IMPI/iDRAC tooling
Certifications: Linux and Security (required or in-progress)
Excellent communication, documentation, and problem-solving skills
Clearance: Active TS/SCI required or ability to obtain Bonus Points
For Experience operating in secure DoD or intelligence environments
Familiarity with Palantir platforms or other government data
systems Prior experience supporting AI/ML infrastructure in
production or tactical settings Experience with performance tuning
and monitoring of HPC or GPU-accelerated clusters General Factors:
Depending on project requirements, may be required to work within a
compressed schedule; overtime should be expected when schedules
demand it. Willing to travel, if needed. No Relocation . Why Bridge
Defense Shape how advanced computing supports national security
missions at scale Lead engineering for a major government program
with direct mission impact Competitive compensation, benefits, and
growth opportunities in a mission-driven environment Bridge Defense
is committed to building a collaborative and mission-focused team.
Bridge Defense reserves the right to modify job duties or
requirements at any time. Employment with Bridge Defense is
at-will. Candidates must be eligible to work in the United States
and complete any required background checks or security clearance
processes as a condition of employment.
Keywords: Bridge Defense, Rockville , Lead Site Reliability Engineer, IT / Software / Systems , Washington, Maryland