Nscale

Technology

PrincipalTechnicalProgramManager(TPM)-AIInfrastructureOperations

$220–330k ~AI est. United States Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Principal Technical Program Manager (TPM) - AI Infrastructure Operations at Nscale. Skills: AI Infrastructure Operations, Technical Program Management, GPU Cloud, Data Center Operations. Own planning, execution, delivery of programs. Drive infrastructure build-outs”

What You'll Achieve.

Achieve Availability Target 97.5%; Achieve Uptime Target 99%; Reduce toil; Improve Mean Time to Recovery

Industry & Context.

Technology

What They're Looking For.

Must Have

5+ years Technical Program Management, Foundational data center infrastructure understanding, Foundational distributed systems understanding, Foundational Linux understanding, Foundational networking concepts understanding, Modern program management methodologies expertise, Define operational metrics experience, Track operational metrics experience, Improve system performance experience, Thrive in fast-paced environment, Manage multiple priorities, Adapt to evolving technical requirements

Nice to Have

PMP certification preferred, Data center infrastructure build-outs experience, Hardware commissioning processes experience, AI/HPC infrastructure domain knowledge, NVIDIA GPUs knowledge, InfiniBand/RDMA networks knowledge, Hyperscale environment experience, Public cloud environment experience, SRE principles familiarity, Automation tooling familiarity, CI/CD pipelines familiarity, Bachelor's degree in technical field, Master's degree in technical field, Equivalent practical experience

What You'll Do.

Drive infrastructure build-outs

Drive fleet software rollouts

Implement operational tooling

Establish critical infrastructure KPIs

Track critical infrastructure KPIs

Drive accountability against KPIs

Develop clear dashboards

Provide leadership visibility

Analyze operational workflows

Optimize operational workflows

Drive standardization of incident management

Drive standardization of change management

Drive standardization of postmortem processes

Improve Mean Time to Recovery

Serve as liaison between teams

Translate capacity models

Create infrastructure roadmaps

Integrate new hardware

Meet go-live criteria

Identify technical risks

Identify schedule risks

Identify resource risks

Develop mitigation strategies

Communicate impacts clearly

How You'll Work.

Team & Collaboration

Cross-functional programs; Fleet Operations; Network Operations; SRE teams; Hardware engineering teams; Compute Platform teams; Network engineering teams; Data Center Operations; External vendors

Communication Scope

Presentation skills; Communication skills

Process & Methodology

Agile, Scrum, PMP

Full Job Description

. About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. Role Overview As a Technical Program Manager (TPM) for AI Infrastructure Operations, you will be the operational backbone of our high-scale, high-performance AI and High-Performance Computing (HPC) environment. You will be responsible for driving complex, cross-functional programs that ensure the stability, availability, and growth of our cutting-edge GPU fleet and Infiniband network fabrics. This role requires a blend of deep technical understanding, rigorous program management, and a relentless focus on delivering against key operational metrics (SLAs, Uptime, Availability). You will bridge the gap between engineering execution and strategic business goals, directly impacting our ability to serve customer workloads at scale. Key Responsibilities Program Leadership: Own the planning, execution, and delivery of strategic operational programs, including new data center AI infrastructure build-outs, large-scale fleet software/firmware rollouts, and the implementation of new operational tooling (in partnership with SRE). Metrics and Reporting: Establish, track, and drive accountability against critical infrastructure KPIs, specifically focusing

Free ATS check

Applying for this Principal Technical Program Manager (TPM) - AI Infrastructure Operations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nscale?

Real rants from real employees. Read before you apply.

Read Company Rants →