Company
Technology
SeniorSiteReliability
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability. Skills: Site Reliability Engineering, Cloud infrastructure, Incident management, Automation. Own production service reliability. Improve production service reliability”
What You'll Achieve.
Drive best-in-class service availability
Industry & Context.
Analytical skills; Problem-solving skills; Diagnose complex production issues
On-call rotations, Multiple time zones
What They're Looking For.
Must Have
Bachelor’s degree in Computer Science, experience with cloud-native concepts, Google Cloud Platform (GCP) experience, Kubernetes (GKE) experience, Site Reliability Engineering experience, production incident management experience, monitoring and observability tools experience, reliability testing exposure, resilience engineering exposure, cost optimisation initiatives exposure, Excellent analytical skills, Excellent problem-solving skills, Software development experience, automation experience using Python, automation experience using shell scripts, production cloud infrastructure experience at scale, multi-region production systems experience, high-availability production systems experience, scalability focus, resilience focus, minimising service disruption focus
Nice to Have
ServiceNow experience preferred, Splunk Observability experience preferred, OpenTelemetry experience preferred
What You'll Do.
Own production service reliability
Improve production service reliability
Own production service availability
Improve production service availability
Own production service performance
Improve production service performance
Participate in incident management
Use incident workflows
Improve incident workflows
Improve incident tooling
Design observability solutions
Implement observability solutions
Operate observability solutions
Reduce operational toil
Introduce engineering-led solutions
Drive engineering-led solutions
Introduce SRE best practices
Drive SRE best practices
Support on-call rotations
Monitor error budgets
Drive best-in-class service availability
Be accountable for service availability
How You'll Work.
Team & Collaboration
Connect with people; Connect with teams
Full Job Description
## Responsibilities Own and improve the reliability, availability, and performance of production services in Google Cloud (GCP). Participate in incident management, including detection, triage, mitigation, escalation, and recovery. Use and improve incident workflows and tooling (e.g., ServiceNow) to ensure clear ownership and timely communication. Design, implement, and operate observability solutions including monitoring, logging, tracing, synthetics, and dashboards (e.g., Splunk Observability, OpenTelemetry). Reduce operational toil through automation and engineering-led solutions, proactively introducing and driving SRE best practices. Support on-call rotations across multiple time zones, contributing to a sustainable 24/7 support model. Define, monitor, and report SLIs, SLOs, and error budgets for critical services. Drive and be accountable for best-in-class service availability through SRE principles, automation, and proactive reliability engineering. ## Essential skills and/or Certifications Bachelor’s degree in Computer Science, Information Technology or related field Strong experience with cloud-native concepts and technologies, with a strong preference for Google Cloud Platform (GCP) and Kubernetes (GKE). Proven experience with Site Reliability Engineering and production incident management, ideally using platforms such as ServiceNow. Experience with monitoring and observability tools, including metrics, logs, traces, and synthetics (e.g., Splunk Observability, OpenTelemetry). Exposure to reliability testing, resilience engineering, or cost optimisation initiatives. Excellent analytical and problem-solving skills, with the ability to diagnose complex production issues quickly. Software development or automation experience using Python, shell scripts, or similar languages. Hands-on experience operating production cloud infrastructure at scale. Experience managing multi-region, high-availability production systems with a focus on scalability, resilience, and
Applying for this Senior Site Reliability role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.