Vapi

Technology

MemberofTechnicalStaff,SiteReliablityEngineer

$200–270k San Francisco, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Member of Technical Staff, Site Reliablity Engineer at Vapi. Skills: Site Reliability Engineering, Incident Management, Observability, Kubernetes Ops. Join oncall rotation. Walk stability-gap incidents”

What You'll Achieve.

99.99% call completion; Improve p99 call completion; Reduce MTTR

Industry & Context.

Technology
Problems you'll solve

Root cause analysis

Eligibility Requirements

Oncall rotation

What They're Looking For.

Must Have

Run incident command, Own postmortem discipline, Operate SLOs and error budgets, Do capacity planning, Do load testing, Fluent in Kubernetes production ops, Know backpressure and autoscaling patterns

Nice to Have

Ship code in Go or TypeScript, Real-time / latency-sensitive product background

What You'll Do.

Walk stability-gap incidents

Turn patterns into backlog

Stand up error budgets

Stand up SLO-based alerting

Ship platform service

Own postmortem process

Drive measurable improvement

Full Job Description

Voice AI that resolves, not transfers. Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes. - The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth - The customers: Amazon Ring, ServiceTitan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500 - The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M WHY WE’RE HIRING THIS ROLE: - 99.99% call completion is the number this role drives. Vapi runs live phone calls — a p99 spike means callers drop. We’ve had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch. - This is not a bash-and-YAML role. You’ll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi’s wscaler and workerpool-cron-scaler are on your plate. WHAT YOU’LL DO: - 30 Day: Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path. - 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / workerpool-cron-scaler. - 90 Day: Ship a real platform service — capacity forecaster, auto-remediation, or oncall tooling — in Go or TypeScript. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR. WHO YOU ARE: Must-haves - You’ve run incident command and postmortem disciplin

Free ATS check

Applying for this Member of Technical Staff, Site Reliablity Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Vapi?

Real rants from real employees. Read before you apply.

Read Company Rants →