Devsu

Information Technology and Services

SeniorSiteReliabilityEngineer(SRE)-(GCP)

Ontario, Canada FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Site Reliability Engineer (SRE) - (GCP) at Devsu. Skills: Site Reliability Engineering, Monitoring, Observability, Grafana, Kubernetes, GCP. Own and operate the monitoring and observability stack across on-prem and GCP environments. Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications”

What You'll Achieve.

Ensure high signal-to-noise ratio for alerts; Improve visibility into system health, performance, and reliability; Improve availability, performance, and resilience; Reduce repeat incidents

Industry & Context.

Information Technology and Services
Problems you'll solve

Troubleshoot application issues; Troubleshoot platform-level issues

Eligibility Requirements

Ability to work in PST time zone, Ability to participate in an on-call rotation that includes coverage for one weekend day

What They're Looking For.

Must Have

experience as a Site Reliability Engineer or Reliability Engineer, Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting), Solid experience with monitoring and observability systems, Production experience operating Kubernetes environments, Experience supporting systems in GCP and on-prem environments, Linux systems and troubleshooting skills, Fluent English (written and spoken), Ability to work in PST time zone, Ability to participate in an on-call rotation that includes coverage for one weekend day

Nice to Have

Experience supporting application teams during SEV incidents, Knowledge of capacity planning and performance tuning, Scripting skills (Python, Bash, etc.), Experience with hybrid infrastructure environments

What You'll Do.

Own and operate the monitoring and observability stack across on-prem and GCP environments

and maintain Grafana dashboards for infrastructure

and maintain alerts to ensure high signal-to-noise ratio

Establish observability standards and best practices across teams

Improve visibility into system health

Apply SRE principles to improve availability

Define and track SLIs

Participate in on-call rotations and SEV incident response

Lead or contribute to incident investigations and root cause analysis (RCA)

Drive preventative actions to reduce repeat incidents

Support and monitor Kubernetes environments (GKE and on-prem clusters)

Monitor cluster health

and resource utilization

Troubleshoot platform-level issues impacting application reliability

Collaborate with Platform and Engineering teams on reliability improvements

Provide L2/L3 application support coverage during support team resource shortages

high-severity incidents (SEVs)

peak support periods or escalations

Triage and troubleshoot application issues using existing runbooks and dashboards

Collaborate with Application Support and Engineering teams during incidents

and resolutions are documented in ServiceNow (SNOW)

How You'll Work.

Team & Collaboration

Collaborate with Platform and Engineering teams on reliability improvements; Collaborate with Application Support and Engineering teams during incidents

Communication Scope

Fluent English (written and spoken)

Full Job Description

We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. ### Responsibilities Monitoring & Observability (Core Focus) * Own and operate the monitoring and observability stack across on-prem and GCP environments * Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications * Define, tune, and maintain alerts to ensure high signal-to-noise ratio * Establish observability standards and best practices across teams * Improve visibility into system health, performance, and reliability Site Reliability Engineering * Apply SRE principles to improve availability, performance, and resilience * Define and track SLIs, SLOs, and error budgets * Participate in on-call rotations and SEV incident response * Lead or contribute to incident investigations and root cause analysis (RCA) * Drive preventative actions to reduce repeat incidents Kubernetes & Platform Reliability * Support and monitor Kubernetes environments (GKE and on-prem clusters) * Monitor cluster health, capacity, and resource utilization * Troubleshoot platform-level issues impacting application reliability * Collaborate with Platform and Engineering teams on reliability improvements ### Secondary Responsibilities (Backup Application Support) * These responsibilities are activated as needed, not part of day-to-day operations. * Provide L2/L3 application support coverage during: * Support team resource shortages * High-severity incidents

Free ATS check

Applying for this Senior Site Reliability Engineer (SRE) - (GCP) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Devsu?

Real rants from real employees. Read before you apply.

Read Company Rants →