The Boeing Company
CloudReliabilityManager
“Cloud Reliability Manager at The Boeing Company. Skills: Cloud Reliability, SRE, Cloud Operations, Kubernetes, Observability, Incident Management, Automation. Lead the Cloud Reliability organization, owning Runtime Site Reliability Engineer (SRE) and Cloud Operations. Accountable for the reliability, scalability, and operational excellence of the enterprise runtime and shared cloud platform”
What You'll Achieve.
Meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs); Ensure rapid detection and reduced Mean Time to Repair (MTTR); Ensure actionable alerts and proactive detection of platform issues; Implement durable fixes and prevent recurrence; Reduce manual toil; Ensure teams are trained and drills executed regularly; Integrate security, compliance, and change management controls into operational procedures and emergency responses
Industry & Context.
Root Cause Analysis (RCA)
On-call responsibilities, U. S. Person as defined by 22 C. F. R. §120. 62 is required, Shift 1
What They're Looking For.
Must Have
5+ years in cloud operations, SRE, and/or related roles, 3+ years managing technical teams with on-call responsibilities, 3+ years of experience with Kubernetes at scale and multi-cloud runtime platforms (Elastic Kubernetes Service (EKS)/Azure Kubernetes Service (AKS)/Google Kubernetes Engine (GKE)), 3+ years of experience with observability tooling (Prometheus, Grafana, OpenTelemetry, Elasticsearch, Logstash, Kibana (ELK)/Elasticsearch, Fluentd, Kibana (EFK), tracing) and alerting design, Experience owning incident response and improving reliability metrics in production environments, Experience with capacity planning, performance engineering, and disaster recovery at cloud scale, Experience with automation tooling (Terraform, Continuous Integration/Continuous Deployment (CI/CD), operators) and integrating reliability into IaC pipelines
Nice to Have
Experience with excellent communication skills while being able to run incident command, lead post-incident reviews, and present technical and business impacts to stakeholders, Experience managing both strategic SRE and operational Tier 0/1 teams in a single organization, Experience in chaos engineering, resilience testing, and failure injection frameworks, Experience with policy as code, security incident response, and regulatory compliance needs in cloud operations, Experience with coding/automation in Go, Python, and/or scripting languages and building operational runbooks as code
What You'll Do.
Lead the Cloud Reliability organization
owning Runtime Site Reliability Engineer (SRE) and Cloud Operations
Accountable for the reliability
and operational excellence of the enterprise runtime and shared cloud platform
Drive reliability automation
and post-incident remediation across multi-cloud environments
and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs)
Establish and own incident management processes: detection
post-incident reviews
and remediation ensure rapid detection and reduced Mean Time to Repair (MTTR)
Drive observability and telemetry strategy (metrics
logs) to ensure actionable alerts and proactive detection of platform issues
Lead capacity planning
and disaster recovery orchestration for platform services and multi-cluster fleets
Convert Root Cause Analysis (RCA) outcomes into prioritized engineering work: coordinate with Platform Acceleration
Development Experience (DevEx)
and Security to implement durable fixes and prevent recurrence
Define and measure operational Key Performance Indicator (KPIs) (Mean time to detect (MTTD)
Mean time to acknowledge (MTTA)
Mean time to recover (MTTR)
alert noise) and implement automation to reduce manual toil
Own on-call and rotation policies
and operational ensure teams are trained and drills executed regularly
and change management controls are integrated into operational procedures and emergency responses
How You'll Work.
Team & Collaboration
Coordinate with Platform Acceleration, Foundations, Development Experience (DevEx), and Security to implement durable fixes and prevent recurrence; Represent Cloud Reliability in architecture reviews, executive incident briefings, and cross-team governance forums
Communication Scope
Excellent communication skills; Run incident command; Lead post-incident reviews; Present technical and business impacts to stakeholders; Executive incident briefings
Process & Methodology
Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations
Applying for this Cloud Reliability Manager role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about The Boeing Company?
Real rants from real employees. Read before you apply.