Nscale
Technology
PrincipalTechnicalProgramManager(TPM)-AIInfrastructureOperations
Neural analysis suggests this role is
optimal for Senior candidates.
“Principal Technical Program Manager (TPM) - AI Infrastructure Operations at Nscale. Skills: AI Infrastructure Operations, Technical Program Management, GPU Cloud, Data Center Operations. Own planning, execution, delivery of programs. Drive infrastructure build-outs”
What You'll Achieve.
Achieve Availability Target 97.5%; Achieve Uptime Target 99%; Reduce toil; Improve Mean Time to Recovery
Industry & Context.
What They're Looking For.
Must Have
5+ years Technical Program Management, Foundational data center infrastructure understanding, Foundational distributed systems understanding, Foundational Linux understanding, Foundational networking concepts understanding, Modern program management methodologies expertise, Define operational metrics experience, Track operational metrics experience, Improve system performance experience, Thrive in fast-paced environment, Manage multiple priorities, Adapt to evolving technical requirements
Nice to Have
PMP certification preferred, Data center infrastructure build-outs experience, Hardware commissioning processes experience, AI/HPC infrastructure domain knowledge, NVIDIA GPUs knowledge, InfiniBand/RDMA networks knowledge, Hyperscale environment experience, Public cloud environment experience, SRE principles familiarity, Automation tooling familiarity, CI/CD pipelines familiarity, Bachelor's degree in technical field, Master's degree in technical field, Equivalent practical experience
What You'll Do.
Drive infrastructure build-outs
Drive fleet software rollouts
Implement operational tooling
Establish critical infrastructure KPIs
Track critical infrastructure KPIs
Drive accountability against KPIs
Develop clear dashboards
Provide leadership visibility
Analyze operational workflows
Optimize operational workflows
Drive standardization of incident management
Drive standardization of change management
Drive standardization of postmortem processes
Improve Mean Time to Recovery
Serve as liaison between teams
Translate capacity models
Create infrastructure roadmaps
Integrate new hardware
Meet go-live criteria
Identify technical risks
Identify schedule risks
Identify resource risks
Develop mitigation strategies
Communicate impacts clearly
How You'll Work.
Team & Collaboration
Cross-functional programs; Fleet Operations; Network Operations; SRE teams; Hardware engineering teams; Compute Platform teams; Network engineering teams; Data Center Operations; External vendors
Communication Scope
Presentation skills; Communication skills
Process & Methodology
Agile, Scrum, PMP
Full Job Description
. About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. Role Overview As a Technical Program Manager (TPM) for AI Infrastructure Operations, you will be the operational backbone of our high-scale, high-performance AI and High-Performance Computing (HPC) environment. You will be responsible for driving complex, cross-functional programs that ensure the stability, availability, and growth of our cutting-edge GPU fleet and Infiniband network fabrics. This role requires a blend of deep technical understanding, rigorous program management, and a relentless focus on delivering against key operational metrics (SLAs, Uptime, Availability). You will bridge the gap between engineering execution and strategic business goals, directly impacting our ability to serve customer workloads at scale. Key Responsibilities Program Leadership: Own the planning, execution, and delivery of strategic operational programs, including new data center AI infrastructure build-outs, large-scale fleet software/firmware rollouts, and the implementation of new operational tooling (in partnership with SRE). Metrics and Reporting: Establish, track, and drive accountability against critical infrastructure KPIs, specifically focusing
Applying for this Principal Technical Program Manager (TPM) - AI Infrastructure Operations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Nscale?
Real rants from real employees. Read before you apply.