Devsu
Information Technology and Services
SeniorSiteReliabilityEngineer(SRE)-(GCP)
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability Engineer (SRE) - (GCP) at Devsu. Skills: Site Reliability Engineering, Monitoring, Observability, Grafana, Kubernetes, GCP. Own and operate the monitoring and observability stack across on-prem and GCP environments. Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications”
What You'll Achieve.
Ensure high signal-to-noise ratio for alerts; Improve visibility into system health, performance, and reliability; Improve availability, performance, and resilience; Reduce repeat incidents
Industry & Context.
Troubleshoot application issues; Troubleshoot platform-level issues
Ability to work in PST time zone, Ability to participate in an on-call rotation that includes coverage for one weekend day
What They're Looking For.
Must Have
experience as a Site Reliability Engineer or Reliability Engineer, Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting), Solid experience with monitoring and observability systems, Production experience operating Kubernetes environments, Experience supporting systems in GCP and on-prem environments, Linux systems and troubleshooting skills, Fluent English (written and spoken), Ability to work in PST time zone, Ability to participate in an on-call rotation that includes coverage for one weekend day
Nice to Have
Experience supporting application teams during SEV incidents, Knowledge of capacity planning and performance tuning, Scripting skills (Python, Bash, etc.), Experience with hybrid infrastructure environments
What You'll Do.
Own and operate the monitoring and observability stack across on-prem and GCP environments
and maintain Grafana dashboards for infrastructure
and maintain alerts to ensure high signal-to-noise ratio
Establish observability standards and best practices across teams
Improve visibility into system health
Apply SRE principles to improve availability
Define and track SLIs
Participate in on-call rotations and SEV incident response
Lead or contribute to incident investigations and root cause analysis (RCA)
Drive preventative actions to reduce repeat incidents
Support and monitor Kubernetes environments (GKE and on-prem clusters)
Monitor cluster health
and resource utilization
Troubleshoot platform-level issues impacting application reliability
Collaborate with Platform and Engineering teams on reliability improvements
Provide L2/L3 application support coverage during support team resource shortages
high-severity incidents (SEVs)
peak support periods or escalations
Triage and troubleshoot application issues using existing runbooks and dashboards
Collaborate with Application Support and Engineering teams during incidents
and resolutions are documented in ServiceNow (SNOW)
How You'll Work.
Team & Collaboration
Collaborate with Platform and Engineering teams on reliability improvements; Collaborate with Application Support and Engineering teams during incidents
Communication Scope
Fluent English (written and spoken)
Full Job Description
We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. ### Responsibilities Monitoring & Observability (Core Focus) * Own and operate the monitoring and observability stack across on-prem and GCP environments * Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications * Define, tune, and maintain alerts to ensure high signal-to-noise ratio * Establish observability standards and best practices across teams * Improve visibility into system health, performance, and reliability Site Reliability Engineering * Apply SRE principles to improve availability, performance, and resilience * Define and track SLIs, SLOs, and error budgets * Participate in on-call rotations and SEV incident response * Lead or contribute to incident investigations and root cause analysis (RCA) * Drive preventative actions to reduce repeat incidents Kubernetes & Platform Reliability * Support and monitor Kubernetes environments (GKE and on-prem clusters) * Monitor cluster health, capacity, and resource utilization * Troubleshoot platform-level issues impacting application reliability * Collaborate with Platform and Engineering teams on reliability improvements ### Secondary Responsibilities (Backup Application Support) * These responsibilities are activated as needed, not part of day-to-day operations. * Provide L2/L3 application support coverage during: * Support team resource shortages * High-severity incidents
Applying for this Senior Site Reliability Engineer (SRE) - (GCP) role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Devsu?
Real rants from real employees. Read before you apply.