HavocAI
Cloud
SeniorSiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability Engineer at HavocAI. Skills: Site Reliability Engineering, Distributed systems, Cloud Platform, Observability. Design reliability architecture. Evolve reliability architecture”
What You'll Achieve.
Ensure availability; Ensure performance; Ensure resilience; Improve operational maturity; Build systems that scale; Raise reliability; Raise performance; Raise operational maturity; Ensure services are observable; Ensure services are resilient; Ensure services are scalable; Ensure services are recoverable; Structure incident response; Reduce operational toil; Improve deployment safety; Design systems for operational demands
Industry & Context.
Root cause analysis; Troubleshooting
On-call rotations, U. S. Citizen, Eligible for security clearance
What They're Looking For.
Must Have
7+ years of experience, Experience operating large-scale distributed production systems, Deep understanding of Linux systems, Deep understanding of networking, Deep understanding of cloud infrastructure, Deep understanding of distributed systems fundamentals, Hands-on experience with Kubernetes, Hands-on experience with container orchestration, Programming or scripting experience in Go, Programming or scripting experience in Python, Experience designing observability systems, Experience operating observability systems, Proven ability to lead incident response, Proven ability to drive reliability improvements, Ability to operate calmly under pressure, Must be a U. S. Citizen
Nice to Have
Experience supporting autonomy, Experience supporting robotics, Experience supporting simulation, Experience supporting real-time systems, Experience supporting data-intensive platforms, Familiarity with AWS, Experience with large-scale cloud infrastructure, Experience with chaos engineering, Experience with fault injection, Experience with resilience testing, Knowledge of CI/CD systems, Knowledge of progressive delivery practices, Experience working in high-reliability environments, Experience working in safety-critical environments, Experience working in defense environments, Experience working in mission-critical environments, Experience with Infrastructure as Code tools, Experience with Terraform, Experience with Pulumi, Experience with Prometheus, Experience with Grafana, Experience with OpenTelemetry, Experience with Datadog, Experience with ELK/OpenSearch, Experience with similar observability tools
What You'll Do.
Design reliability architecture
Evolve reliability architecture
Define SRE best practices
Implement SRE best practices
Define capacity planning
Partner with teams to design systems
Identify systemic reliability risks
Mitigate systemic reliability risks
Establish reliability patterns
Lead incident response processes
Conduct post-incident reviews
Conduct root cause analysis
Drive long-term corrective actions
Improve operational readiness
Reduce operational toil
Build culture of ownership
Build culture of accountability
Build culture of continuous improvement
Design observability systems
Implement observability systems
Maintain observability systems
Ensure services are observable
Ensure data pipelines are observable
Ensure services are debuggable
Ensure data pipelines are debuggable
Ensure services are performant
Ensure data pipelines are performant
Drive performance analysis
Drive performance tuning
Improve alert quality
Ensure signals are actionable
Partner with engineering teams to define metrics
Build automation to improve system reliability
Build automation to improve deployment safety
Build automation to improve recovery processes
Partner with DevOps on CI/CD reliability
Partner with Cloud Platform on CI/CD reliability
Partner with DevOps on rollout strategies
Partner with Cloud Platform on rollout strategies
Partner with DevOps on safe deployment patterns
Partner with Cloud Platform on safe deployment patterns
Support Kubernetes-based environments
Improve Kubernetes-based environments
Support containerized workloads
Improve containerized workloads
Contribute to infrastructure-as-code practices
Contribute to platform automation
Define operational standards for cloud infrastructure
Define operational standards for deployment workflows
Define operational standards for production services
Collaborate with security teams
Ensure secure system design
Ensure resilient system design
Participate in disaster recovery planning
Participate in backup strategy
Participate in resilience testing
Maintain operational practices around access control
Maintain operational practices around secrets management
Maintain operational practices around change management
Maintain operational practices around production access
Support secure operations for systems
How You'll Work.
Team & Collaboration
Cloud Platform team; DevOps teams; Data Engineering teams; Autonomy teams; Platform teams; Application teams; Engineering teams; Security teams
Communication Scope
Communication skills
Process & Methodology
Capacity planning
Full Job Description
ABOUT US: Collaborative autonomy is how self-tasking teams of machines will solve hard human problems, and HavocAI is an unquestioned leader in collaborative autonomy. We set the standard for autonomous surface vessels for a wide range of defense and commercial maritime missions. Success requires us to grow quickly, and we’re looking for teammates who are passionate about solving hard problems, about pushing the envelope, and about preventing conflict and saving lives. Ambition is welcome to apply within. ABOUT THE ROLE HavocAI is seeking a Senior Site Reliability Engineer with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will serve as a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads. You will work closely with Cloud Platform, DevOps, Data Engineering, and Autonomy teams to establish reliability standards, improve operational maturity, and build systems that scale safely under real-world conditions. The ideal candidate is deeply technical, calm under pressure, and experienced in owning reliability outcomes end to end. KEY RESPONSIBILITIES RELIABILITY ENGINEERING & ARCHITECTURE - Design and evolve reliability architecture for distributed and cloud-hosted systems - Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning - Partner with platform and application teams to design systems for reliability, scalability, and operability - Identify and mitigate systemic reliability risks across infrastructure, applications, services, and data pipelines - Establish reliability patterns that support autonomy, simulation, and mission-critical cloud workloads OPERATIONS & INCIDENT MANAGEMENT - Lead incident response processes, including on-call rotations, escalation paths, and post-incident reviews - Conduct
Applying for this Senior Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about HavocAI?
Real rants from real employees. Read before you apply.