HavocAI

Cloud

SeniorSiteReliabilityEngineer

$150–185k Remote FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Site Reliability Engineer at HavocAI. Skills: Site Reliability Engineering, Distributed systems, Cloud Platform, Observability. Design reliability architecture. Evolve reliability architecture”

What You'll Achieve.

Ensure availability; Ensure performance; Ensure resilience; Improve operational maturity; Build systems that scale; Raise reliability; Raise performance; Raise operational maturity; Ensure services are observable; Ensure services are resilient; Ensure services are scalable; Ensure services are recoverable; Structure incident response; Reduce operational toil; Improve deployment safety; Design systems for operational demands

Industry & Context.

Cloud
Problems you'll solve

Root cause analysis; Troubleshooting

Eligibility Requirements

On-call rotations, U. S. Citizen, Eligible for security clearance

What They're Looking For.

Must Have

7+ years of experience, Experience operating large-scale distributed production systems, Deep understanding of Linux systems, Deep understanding of networking, Deep understanding of cloud infrastructure, Deep understanding of distributed systems fundamentals, Hands-on experience with Kubernetes, Hands-on experience with container orchestration, Programming or scripting experience in Go, Programming or scripting experience in Python, Experience designing observability systems, Experience operating observability systems, Proven ability to lead incident response, Proven ability to drive reliability improvements, Ability to operate calmly under pressure, Must be a U. S. Citizen

Nice to Have

Experience supporting autonomy, Experience supporting robotics, Experience supporting simulation, Experience supporting real-time systems, Experience supporting data-intensive platforms, Familiarity with AWS, Experience with large-scale cloud infrastructure, Experience with chaos engineering, Experience with fault injection, Experience with resilience testing, Knowledge of CI/CD systems, Knowledge of progressive delivery practices, Experience working in high-reliability environments, Experience working in safety-critical environments, Experience working in defense environments, Experience working in mission-critical environments, Experience with Infrastructure as Code tools, Experience with Terraform, Experience with Pulumi, Experience with Prometheus, Experience with Grafana, Experience with OpenTelemetry, Experience with Datadog, Experience with ELK/OpenSearch, Experience with similar observability tools

What You'll Do.

Design reliability architecture

Evolve reliability architecture

Define SRE best practices

Implement SRE best practices

Define capacity planning

Partner with teams to design systems

Identify systemic reliability risks

Mitigate systemic reliability risks

Establish reliability patterns

Lead incident response processes

Conduct post-incident reviews

Conduct root cause analysis

Drive long-term corrective actions

Improve operational readiness

Reduce operational toil

Build culture of ownership

Build culture of accountability

Build culture of continuous improvement

Design observability systems

Implement observability systems

Maintain observability systems

Ensure services are observable

Ensure data pipelines are observable

Ensure services are debuggable

Ensure data pipelines are debuggable

Ensure services are performant

Ensure data pipelines are performant

Drive performance analysis

Drive performance tuning

Improve alert quality

Ensure signals are actionable

Partner with engineering teams to define metrics

Build automation to improve system reliability

Build automation to improve deployment safety

Build automation to improve recovery processes

Partner with DevOps on CI/CD reliability

Partner with Cloud Platform on CI/CD reliability

Partner with DevOps on rollout strategies

Partner with Cloud Platform on rollout strategies

Partner with DevOps on safe deployment patterns

Partner with Cloud Platform on safe deployment patterns

Support Kubernetes-based environments

Improve Kubernetes-based environments

Support containerized workloads

Improve containerized workloads

Contribute to infrastructure-as-code practices

Contribute to platform automation

Define operational standards for cloud infrastructure

Define operational standards for deployment workflows

Define operational standards for production services

Collaborate with security teams

Ensure secure system design

Ensure resilient system design

Participate in disaster recovery planning

Participate in backup strategy

Participate in resilience testing

Maintain operational practices around access control

Maintain operational practices around secrets management

Maintain operational practices around change management

Maintain operational practices around production access

Support secure operations for systems

How You'll Work.

Team & Collaboration

Cloud Platform team; DevOps teams; Data Engineering teams; Autonomy teams; Platform teams; Application teams; Engineering teams; Security teams

Communication Scope

Communication skills

Process & Methodology

Capacity planning

Full Job Description

ABOUT US: Collaborative autonomy is how self-tasking teams of machines will solve hard human problems, and HavocAI is an unquestioned leader in collaborative autonomy. We set the standard for autonomous surface vessels for a wide range of defense and commercial maritime missions. Success requires us to grow quickly, and we’re looking for teammates who are passionate about solving hard problems, about pushing the envelope, and about preventing conflict and saving lives. Ambition is welcome to apply within. ABOUT THE ROLE HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will serve as a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads. You will work closely with Cloud Platform, DevOps, Data Engineering, and Autonomy teams to establish reliability standards, improve operational maturity, and build systems that scale safely under real-world conditions. The ideal candidate is deeply technical, calm under pressure, and experienced in owning reliability outcomes end to end. KEY RESPONSIBILITIES RELIABILITY ENGINEERING & ARCHITECTURE - Design and evolve reliability architecture for distributed and cloud-hosted systems - Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning - Partner with platform and application teams to design systems for reliability, scalability, and operability - Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines - Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads OPERATIONS & INCIDENT MANAGEMENT - Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews - Conduct

Free ATS check

Applying for this Senior Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about HavocAI?

Real rants from real employees. Read before you apply.

Read Company Rants →