The Boeing Company

CloudReliabilityManager

$162–233k Seattle, Washington, United States; Dallas, Texas, United States; North Charleston, South Carolina, United States; Chicago, Illinois, United States; El Segundo, California, United States; Mesa, Arizona, United States; San Diego, California, United States; Berkeley, California, United States; Hazelwood, Missouri, United States FULL TIME
The Brief

“Cloud Reliability Manager at The Boeing Company. Skills: Cloud Reliability, SRE, Cloud Operations, Kubernetes, Observability, Incident Management, Automation. Lead the Cloud Reliability organization, owning Runtime Site Reliability Engineer (SRE) and Cloud Operations. Accountable for the reliability, scalability, and operational excellence of the enterprise runtime and shared cloud platform”

What You'll Achieve.

Meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs); Ensure rapid detection and reduced Mean Time to Repair (MTTR); Ensure actionable alerts and proactive detection of platform issues; Implement durable fixes and prevent recurrence; Reduce manual toil; Ensure teams are trained and drills executed regularly; Integrate security, compliance, and change management controls into operational procedures and emergency responses

Industry & Context.

Problems you'll solve

Root Cause Analysis (RCA)

Eligibility Requirements

On-call responsibilities, U. S. Person as defined by 22 C. F. R. §120. 62 is required, Shift 1

What They're Looking For.

Must Have

5+ years in cloud operations, SRE, and/or related roles, 3+ years managing technical teams with on-call responsibilities, 3+ years of experience with Kubernetes at scale and multi-cloud runtime platforms (Elastic Kubernetes Service (EKS)/Azure Kubernetes Service (AKS)/Google Kubernetes Engine (GKE)), 3+ years of experience with observability tooling (Prometheus, Grafana, OpenTelemetry, Elasticsearch, Logstash, Kibana (ELK)/Elasticsearch, Fluentd, Kibana (EFK), tracing) and alerting design, Experience owning incident response and improving reliability metrics in production environments, Experience with capacity planning, performance engineering, and disaster recovery at cloud scale, Experience with automation tooling (Terraform, Continuous Integration/Continuous Deployment (CI/CD), operators) and integrating reliability into IaC pipelines

Nice to Have

Experience with excellent communication skills while being able to run incident command, lead post-incident reviews, and present technical and business impacts to stakeholders, Experience managing both strategic SRE and operational Tier 0/1 teams in a single organization, Experience in chaos engineering, resilience testing, and failure injection frameworks, Experience with policy as code, security incident response, and regulatory compliance needs in cloud operations, Experience with coding/automation in Go, Python, and/or scripting languages and building operational runbooks as code

What You'll Do.

Lead the Cloud Reliability organization

owning Runtime Site Reliability Engineer (SRE) and Cloud Operations

Accountable for the reliability

and operational excellence of the enterprise runtime and shared cloud platform

Drive reliability automation

and post-incident remediation across multi-cloud environments

and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs)

Establish and own incident management processes: detection

post-incident reviews

and remediation ensure rapid detection and reduced Mean Time to Repair (MTTR)

Drive observability and telemetry strategy (metrics

logs) to ensure actionable alerts and proactive detection of platform issues

Lead capacity planning

and disaster recovery orchestration for platform services and multi-cluster fleets

Convert Root Cause Analysis (RCA) outcomes into prioritized engineering work: coordinate with Platform Acceleration

Development Experience (DevEx)

and Security to implement durable fixes and prevent recurrence

Define and measure operational Key Performance Indicator (KPIs) (Mean time to detect (MTTD)

Mean time to acknowledge (MTTA)

Mean time to recover (MTTR)

alert noise) and implement automation to reduce manual toil

Own on-call and rotation policies

and operational ensure teams are trained and drills executed regularly

and change management controls are integrated into operational procedures and emergency responses

How You'll Work.

Team & Collaboration

Coordinate with Platform Acceleration, Foundations, Development Experience (DevEx), and Security to implement durable fixes and prevent recurrence; Represent Cloud Reliability in architecture reviews, executive incident briefings, and cross-team governance forums

Communication Scope

Excellent communication skills; Run incident command; Lead post-incident reviews; Present technical and business impacts to stakeholders; Executive incident briefings

Process & Methodology

Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations

Free ATS check

Applying for this Cloud Reliability Manager role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about The Boeing Company?

Real rants from real employees. Read before you apply.

Read Company Rants →