Site Reliability Engineer
Company: Cirrus Group Consulting
Location: Reston
Posted on: February 13, 2026
|
|
|
Job Description:
Job Description Job Description Site Reliability Engineer
LOCATION: Reston, VA SUMMARY OF POSITION The Site Reliability
Engineer (i.e., “SRE”) role is responsible for the optimization and
reliability of core technical platforms and platform services, and
exerting significant technical leadership in the continuous
improvement of service reliability to platform stakeholders. The
SRE will champion the overall health of OF core technical
platforms, lead the response to operational incidents, determine
root causes, propose and implement remediations that ensure overall
platform viability. OF IT platforms and infrastructure exist over
three locations (i.e., “on-premise”), including, Office
Headquarters (Reston, VA), Primary Data Center Co-Location
(Sterling, VA), and Disaster Recovery Data Center Co-Location
(Chicago, IL), as well as a limited set of infrastructure services
provided by Microsoft Azure (i.e., “Azure”). The core technical
platform is Red Hat OpenShift, with a variety of platform services
to include, but not limited to, Red Hat AMQ, HashiCorp Vault, and
Keycloak, that are consumed by various platform stakeholders. This
role will span from the OpenShift platform to services provided by
Azure. We’re proud of the way our teammates have a positive impact
on everything we do. Our employees are committed to and exemplify
our Core Values: Integrity through accountability, consistency,
transparency and trust Agility through adaptability, continuous
improvement, expertise, and flexibility Partnership through
collaboration, communication, leadership, and teamwork Inclusivity
through diversity, relationships, respect, and support PRINCIPAL
RESPONSIBILITIES Maintain overall health and reliability of core
technical platforms and platform services to ensure business
continuity and high availability. Maintain and improve the
end-to-end observability of the platform, to ensure that platform
state is at all times understood in context with supporting
information and data that can be quickly marshalled into action.
Lead incident response, root-cause analysis, and postmortems that
advance the overall health of the system and prevent or diminish
reoccurrence of platform issues. Partner with development teams to
troubleshoot platform issues, to include deployment, routing, and
configuration challenges. Build and maintain automated deployment
pipelines that support engineering, development and data teams.
Write, test, and deploy solutions that reduce unneeded human
intervention and improve quality. Lead the delivery of new platform
features, services, and capabilities. Prioritize, deliver, and
operate new platform capabilities products and services. Develop
and maintain accurate and up-to-date documentation, including but
not limited to operational procedures, deployment plans, incident
response plans. Participate in on-call rotation. Assist with other
job duties as assigned. PRINCIPAL JOB REQUIREMENTS Bachelor's
degree in computer science or related field, or equivalent
experience. Minimum of 5-7 years of experience in a Site
Reliability Engineering and/or Platform Engineering role, with
progressively increasing scope of responsibility. Extensive
hands-on experience and knowledge of the following technologies:
Red Hat OpenShift, inclusive of operators, routing/ingress, and
cluster management Azure cloud services and solutions Messaging
platforms like AMQ, Kafka, Reddis HashiCorp Vault Scripting
languages like Bash, Python, Go, PowerShell Observability tools
like Datadog, Grafana, Prometheus Strong scripting and automation
skills in Bash, Python. Strong prior experience with observability
tools and connecting trends, incidents and alerts with actions.
Prior experience troubleshooting complex production issues using
logs, metrics, traces, packet captures, and Kubernetes debugging
tools. Prior experience working in a heavily audited environment is
preferred, with focus on mitigating risks and ensuring compliance
with policies and procedures. Knowledge of enterprise-level
technologies and concepts. Ability to multi-task in a dynamic
environment while continuing to progress on longer term projects.
Ability to communicate well, both orally and in writing, including
producing thorough documentation of all work. Ability to conduct
independent technical research and share results with management
and/or peers. Ability to listen and integrate ideas from different
views, build and maintain respectful relationships, collaborate
with others, and resolve conflicts constructively. Proof of
eligibility to work in the United States.
Keywords: Cirrus Group Consulting, Rockville , Site Reliability Engineer, IT / Software / Systems , Reston, Maryland