Monstro

SiteReliabilityEngineer(SRE)

Hybrid Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid candidates.

The Brief

“Site Reliability Engineer (SRE) at Monstro. Skills: Site Reliability Engineering, GCP, Observability, Reliability, Automation, Incident Response, Kubernetes, IaC, Python, Go, Bash. Own the reliability and observability of the platform end-to-end. Write the dashboards”

Industry & Context.

Problems you'll solve

Fixing the system, not the symptom; Enjoy the puzzle of understanding complex environments

Eligibility Requirements

On-call rotation, First responder for production alerts, Incident response

What They're Looking For.

Must Have

Solid production experience on GCP (or comparable AWS/Azure depth with willingness to ramp on GCP fast), Comfortable on-call: you’ve run incidents, written postmortems, and shipped the action items, observability fundamentals: SLOs, log-based metrics, alert hygiene, dashboard discipline, Working knowledge of Kubernetes, API gateways, identity systems, and at least one IaC tool, Scripting / coding fluency (Python, Go, Bash) for automation and tooling, Good written communication — handoffs, postmortems, and runbooks are part of the job, Bias toward fixing the system, not the symptom

Nice to Have

Apigee or another enterprise API gateway in production, BigQuery for log analytics or audit, Experience standing up observability from scratch, not just maintaining inherited dashboards, SOC2 or similar compliance environments

What You'll Do.

Own the reliability and observability of the platform end-to-end

Build the automation that kills toil

Take your turn on the on-call rotation

Define and maintain SLOs and SLIs for our tier-1 services: API gateway

and edge availability

Build canonical dashboards and alerts in Google Cloud Monitoring

backed by structured logs and BigQuery log analytics

Tune alert routing so every page is actionable — kill the rest

Instrument services for distributed tracing and structured push back on services that ship without it

Own error budgets and use them to prioritize reliability work over feature work when burned

Reduce toil: automate the top recurring page from the previous quarter

Maintain runbooks so every page maps to one within a cycle of first occurrence

First responder for production alerts across monitoring

run the incident bridge

drive mitigation (revision rollback

Own internal and external incident comms during your shift

Drive postmortems to closure with action items tracked as audit evidence

Clean written handoffs at end of shift

How You'll Work.

Team & Collaboration

Partner with Delivery Manager; Work directly with clients

Communication Scope

Good written communication — handoffs, postmortems, and runbooks are part of the job; Internal and external incident comms during your shift; Clean written handoffs at end of shift

Process & Methodology

Use error budgets to prioritize reliability work over feature work when burned, Track action items from postmortems as audit evidence

Full Job Description

About the Role Monstro is building a secure, multi-tenant platform on Google Cloud, and we’re hiring a Site Reliability Engineer to own the reliability and observability of that platform end-to-end. This is a hands-on role for someone who wants to do real SRE work - not a rebrand of L1 support. You’ll write the dashboards, define the SLOs, build the automation that kills toil, and take your turn on the on-call rotation that proves it all works. When something breaks at 2 AM, you’re the person who keeps it running; when nothing’s breaking, you’re the person making sure the next break is smaller, shorter, or doesn’t happen at all What You’ll Do Observability and reliability engineering Define and maintain SLOs and SLIs for our tier-1 services: API gateway, application services, identity, and edge availability Build canonical dashboards and alerts in Google Cloud Monitoring, backed by structured logs and BigQuery log analytics Tune alert routing so every page is actionable — kill the rest Instrument services for distributed tracing and structured logging; push back on services that ship without it Own error budgets and use them to prioritize reliability work over feature work when burned Reduce toil: automate the top recurring page from the previous quarter Maintain runbooks so every page maps to one within a cycle of first occurrence On-call rotation and incident response First responder for production alerts across monitoring, API gateway, edge defense, and CI Triage severity, run the incident bridge, drive mitigation (revision rollback, traffic shift, scaling, edge block, credential rotation) Own internal and external incident comms during your shift Drive postmortems to closure with action items tracked as audit evidence Clean written handoffs at end of shift Our stack Google Cloud Platform across multiple environments Apigee X for API management Cloud Run, GKE Autopilot, Cloud SQL Identity Platform for customer identity Cloud Armor, Cloud IDS, Security Command Cen

Free ATS check

Applying for this Site Reliability Engineer (SRE) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Monstro?

Real rants from real employees. Read before you apply.

Read Company Rants →