RapidSOS

public safety AI

SiteReliabilityEngineeringManager

$185–215k Boston, United States; New York, New York, United States Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Manager candidates.

The Brief

“Site Reliability Engineering Manager at RapidSOS. Skills: Site Reliability Engineering, cloud infrastructure management, team leadership, infrastructure-as-code, Kubernetes, AWS. keeping RapidSOS's cloud infrastructure running reliably. helping product teams get to a place where they can run their own services without routing every operational issue through SRE”

Industry & Context.

public safety AI
Problems you'll solve

failure mode analysis; systemic improvements

Eligibility Requirements

on-call manager

What They're Looking For.

Must Have

7+ years in SRE, platform engineering, or DevOps, at least two years where you were responsible for a team and not just your own work, directly responsible for Kubernetes and AWS infrastructure in production environments where uptime and resilience are critical, Experience moving a team from reactive ops toward engineering-first reliability practices, worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production, Ability to write Python, review production-quality scripts, and tooling, applied SLOs, error budgets, and blameless postmortems in practice to improve reliability and drive better engineering decisions, Hands-on familiarity with: Terraform/Atlantis, Kubernetes/Helm/ArgoCD, Datadog, Concourse CI/GitHub Actions, RabbitMQ, and AWS (EKS, RDS/Aurora, ElastiCache, VPC networking, IAM, KMS, Route53)

Nice to Have

Kubernetes a plus

What You'll Do.

keeping RapidSOS's cloud infrastructure running reliably

helping product teams get to a place where they can run their own services without routing every operational issue through SRE

and testing that multi-region failover actually works

Drive the IaC foundation in Terraform/Atlantis

Partner with Engineering Managers to set SLOs for their services

establish error budgets

and help teams build the habits to operate what they build

Maintain proactive reliability work: capacity planning

failure mode analysis

and chaos engineering

run reliability reviews before major launches and organize failure mode exercises with product teams

Drive blameless postmortem practice

ensures every significant incident produces systemic improvements with clear ownership and closure

Run the Tier 1 on-call rotation: scheduling for primary and secondary engineers

coordination with the 3rd-party NOC

and keeping incident escalation processes smooth and manageable

Lead incident command on Sev-1s

and keep engineering leadership informed throughout

Lead and grow a high-impact team by mentoring engineers

and thinking ahead about what the team needs as the function grows

Shape the team’s long-term AI strategy for infrastructure and operations by identifying opportunities for AI-driven automation and insight generation

evaluating tooling and workflows

and operationalizing best practices for scalable team-wide usage

Own reserved instance strategy and the team's AWS cost footprint

error budgets and SLOs across production services and communicate that picture clearly to engineering and product leadership

Work alongside Platform SRE on bigger infrastructure projects: Gateway API adoption

cross-region architecture

How You'll Work.

Team & Collaboration

Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate what they build; worked collaboratively with engineering teams to proactively improve reliability, scalability, and operational readiness before issues reach production; organize failure mode exercises with product teams; Work alongside Platform SRE on bigger infrastructure projects

Communication Scope

communicate that picture clearly to engineering and product leadership; keep engineering leadership informed throughout

Process & Methodology

owning headcount, thinking ahead about what the team needs as the function grows

Full Job Description

In the time it takes you to read this job description, RapidSOS will have handled ~1,380 emergencies. At RapidSOS, we are committed to using technology to build a safer, stronger future and working together to save lives. We’re in an exciting phase of growth, welcoming new members from across the globe to our mission-driven, ambitious, and inclusive team. Our work is founded on our values of elevating purpose, inventing tomorrow, delivering with urgency, serving with integrity, and winning together, all of which support a company culture where people can innovate, collaborate, grow, and, above all, make an impact. RapidSOS is the leading public safety AI company that unlocks mission-critical intelligence for first responders and security teams – enabling faster, smarter and more accurate emergency response. Real-time data from the world’s largest safety network of 700M+ devices, 200+ global enterprises, and 23,000+ federal, state and local agencies fuels the RapidSOS HARMONY AI engine that delivers this intelligence to those who need it most. Learn more at www.RapidSOS.com. What this role is about: This is an engineering leadership role, not simply an on-call manager. The SRE Manager owns two things: keeping RapidSOS's cloud infrastructure running reliably, and helping product teams get to a place where they can run their own services without routing every operational issue through SRE. RapidSOS powers real-time emergency response by connecting life-critical data to first responders, so reliability here directly impacts outcomes in moments that matter. You'll lead the SRE Operations team and report to the Director of SRE ensure upgrades, capacity planning, node scaling, and testing that multi-region failover actually works Drive the IaC foundation in Terraform/Atlantis and champion infrastructure-as-code as a core engineering standard Partner with Engineering Managers to set SLOs for their services, establish error budgets, and help teams build the habits to operate

Free ATS check

Applying for this Site Reliability Engineering Manager role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about RapidSOS?

Real rants from real employees. Read before you apply.

Read Company Rants →