Amazon Data Services, Inc.
Technology
Sr.TechnicalProgramManager-AI/MLHardwareHealth&Stability,GlobalDataCenterOperations
Neural analysis suggests this role is
optimal for Senior candidates.
“Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations at Amazon Data Services, Inc.. Skills: AI/ML Hardware, Data Center Operations, Technical Program Management. Own hardware health and stability metrics. Establish KPIs and routines”
What You'll Achieve.
Ensure GenAI infrastructure delivers value; Improve operational health metrics; Accelerate capacity delivery; Maintain highest operational health standards; Optimize new hardware deployments; Remediate stability issues; Establish scalable best practices; Ensure quality gates are met; Ensure customer commitments are met
Industry & Context.
Root cause analysis; Error correction; Deep-dive analyses; Troubleshooting; Problem identification; Problem remediation
What They're Looking For.
Must Have
5+ years technical program management, 7+ years working with engineering teams, Root cause analysis experience, Experience leading process improvements, Written and verbal communication skills
Nice to Have
Technical account management experience, Business relationship management experience, Consulting experience, Knowledge of Six Sigma tools, Knowledge of Lean techniques, PMP or similar standards knowledge, Server technologies experience, UltraServer infrastructure deployments experience, High-performance computing infrastructure deployments experience, AI/ML infrastructure deployments experience
What You'll Do.
Own hardware health and stability metrics
Establish KPIs and routines
Provide real-time visibility
Drive deep-dive analyses on failures
Drive systematic improvements
Lead cross-functional investigations
Lead post-mortem processes
Translate lessons learned
Develop hardware health scorecards
Inform leadership decisions
Manage complex infrastructure projects
Involve hardware engineering
Involve data center operations
Involve software teams
Establish program schedules
Maintain program schedules
Establish resource plans
Maintain resource plans
Facilitate technical deep dives
Troubleshoot diagnostic issues
Troubleshoot repair issues
Accelerate project delivery
Eliminate non-value-add activities
Optimize deployment velocity
Serve as operational point of contact
Summarize platform operational status
Build trusted advisor relationships
Understand operational needs
Understand technical challenges
Translate operational feedback
Translate customer requirements
Provide strategic technical guidance
Advocate for operational excellence
Integrate hardware health considerations
Influence design decisions
Collaborate on quality gates
Implement appropriate signals
Ensure customer commitments are met
Drive alignment across stakeholders
Present technical assessments
Present recommendations
Articulate trade-offs
Articulate business impact
How You'll Work.
Team & Collaboration
Cross-functional initiatives; Cross-functional teams; Hardware engineering teams; Data center operations teams; Service teams; Software teams; Monitoring teams; Automation teams; Diverse stakeholders
Communication Scope
Written communication; Verbal communication; Technical presentations; Leadership communication
Process & Methodology
Technical program management, Program schedules, Budget management, Resource planning, Risk mitigation, Process improvement, Project delivery acceleration
Full Job Description
The Central Operations team within Amazon Web Services (AWS) Infrastructure is seeking a Senior Technical Program Manager to drive the health, stability, and operational excellence of new hardware deployments across our global data center fleet. This role uniquely blends technical program management with strategic account management to ensure our GenAI and high-performance computing infrastructure delivers maximum value to customers. As a Sr. TPM, you will be the technical advocate and strategic advisor for operational support of new AI/ML hardware platforms. You will serve as the central owner of operational health (failure rate, repair efficacy, repair dwell time, break/fix process improvement) while driving cross-functional initiatives to improve these key performance indicators. You will work at the intersection of hardware engineering, data center operations, and service teams like EC2—translating complex technical data into actionable insights and leading programs that accelerate capacity delivery while maintaining the highest standards of operational health. This is not a sales role, but rather an opportunity to be the 'voice of the customer' and the 'voice of operations' for critical infrastructure that powers AWS's most demanding workloads. You will craft and execute strategies to optimize new hardware deployments, proactively identify and remediate stability issues, and establish best practices that scale across AWS's global infrastructure. Key job responsibilities Hardware Health & Stability Leadership - Own the end-to-end health and stability metrics for new AI/ML hardware platforms, establishing KPIs and routines that provide real-time visibility into operational performance - Drive deep-dive analyses on hardware failures to identify root causes and drive systematic improvements - Lead cross-functional investigations, experiments, and post-mortem processes, ensuring lessons learned translate into preventive measures and design improvements - Develop and
Applying for this Sr. Technical Program Manager - AI/ML Hardware Health & Stability, Global Data Center Operations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon Data Services, Inc.?
Real rants from real employees. Read before you apply.