Amazon Data Services, Inc.

Technology

Sr.TechnicalProgramManager-AI/MLHardwareHealth&Stability,GlobalDataCenterOperations

$700–2010k Seattle, Washington, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations at Amazon Data Services, Inc.. Skills: AI/ML Hardware, Data Center Operations, Technical Program Management. Own hardware health and stability metrics. Establish KPIs and routines”

What You'll Achieve.

Ensure GenAI infrastructure delivers value; Improve operational health metrics; Accelerate capacity delivery; Maintain highest operational health standards; Optimize new hardware deployments; Remediate stability issues; Establish scalable best practices; Ensure quality gates are met; Ensure customer commitments are met

Industry & Context.

Technology
Problems you'll solve

Root cause analysis; Error correction; Deep-dive analyses; Troubleshooting; Problem identification; Problem remediation

What They're Looking For.

Must Have

5+ years technical program management, 7+ years working with engineering teams, Root cause analysis experience, Experience leading process improvements, Written and verbal communication skills

Nice to Have

Technical account management experience, Business relationship management experience, Consulting experience, Knowledge of Six Sigma tools, Knowledge of Lean techniques, PMP or similar standards knowledge, Server technologies experience, UltraServer infrastructure deployments experience, High-performance computing infrastructure deployments experience, AI/ML infrastructure deployments experience

What You'll Do.

Own hardware health and stability metrics

Establish KPIs and routines

Provide real-time visibility

Drive deep-dive analyses on failures

Drive systematic improvements

Lead cross-functional investigations

Lead post-mortem processes

Translate lessons learned

Develop hardware health scorecards

Inform leadership decisions

Manage complex infrastructure projects

Involve hardware engineering

Involve data center operations

Involve software teams

Establish program schedules

Maintain program schedules

Establish resource plans

Maintain resource plans

Facilitate technical deep dives

Troubleshoot diagnostic issues

Troubleshoot repair issues

Accelerate project delivery

Eliminate non-value-add activities

Optimize deployment velocity

Serve as operational point of contact

Summarize platform operational status

Build trusted advisor relationships

Understand operational needs

Understand technical challenges

Translate operational feedback

Translate customer requirements

Provide strategic technical guidance

Advocate for operational excellence

Integrate hardware health considerations

Influence design decisions

Collaborate on quality gates

Implement appropriate signals

Ensure customer commitments are met

Drive alignment across stakeholders

Present technical assessments

Present recommendations

Articulate trade-offs

Articulate business impact

How You'll Work.

Team & Collaboration

Cross-functional initiatives; Cross-functional teams; Hardware engineering teams; Data center operations teams; Service teams; Software teams; Monitoring teams; Automation teams; Diverse stakeholders

Communication Scope

Written communication; Verbal communication; Technical presentations; Leadership communication

Process & Methodology

Technical program management, Program schedules, Budget management, Resource planning, Risk mitigation, Process improvement, Project delivery acceleration

Full Job Description

The Central Operations team within Amazon Web Services (AWS) Infrastructure is seeking a Senior Technical Program Manager to drive the health, stability, and operational excellence of new hardware deployments across our global data center fleet. This role uniquely blends technical program management with strategic account management to ensure our GenAI and high-performance computing infrastructure delivers maximum value to customers. As a Sr. TPM, you will be the technical advocate and strategic advisor for operational support of new AI/ML hardware platforms. You will serve as the central owner of operational health (failure rate, repair efficacy, repair dwell time, break/fix process improvement) while driving cross-functional initiatives to improve these key performance indicators. You will work at the intersection of hardware engineering, data center operations, and service teams like EC2—translating complex technical data into actionable insights and leading programs that accelerate capacity delivery while maintaining the highest standards of operational health. This is not a sales role, but rather an opportunity to be the 'voice of the customer' and the 'voice of operations' for critical infrastructure that powers AWS's most demanding workloads. You will craft and execute strategies to optimize new hardware deployments, proactively identify and remediate stability issues, and establish best practices that scale across AWS's global infrastructure. Key job responsibilities Hardware Health & Stability Leadership - Own the end-to-end health and stability metrics for new AI/ML hardware platforms, establishing KPIs and routines that provide real-time visibility into operational performance - Drive deep-dive analyses on hardware failures to identify root causes and drive systematic improvements - Lead cross-functional investigations, experiments, and post-mortem processes, ensuring lessons learned translate into preventive measures and design improvements - Develop and

Free ATS check

Applying for this Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon Data Services, Inc.?

Real rants from real employees. Read before you apply.

Read Company Rants →