NVIDIA

deep learning

SeniorTechnicalProgramManager,CloudInfrastructure,ObservabilityandSystemsMonitoring

$200–380k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Technical Program Manager, Cloud Infrastructure, Observability and Systems Monitoring at NVIDIA. Skills: Technical Program Management, Cloud Infrastructure, Observability, Systems Monitoring, Telemetry. Lead important programs focused on telemetry and data center fleet health & management. Develop core capabilities for DGX Cloud”

What You'll Achieve.

provide outstanding value to our DGX Cloud customers; develop unified telemetry mentorship and confirm operational readiness worldwide; ensure requirements are committed, delivered, and ingested into a centralized telemetry platform to enable DGXC operations; ensure telemetry requirements for advanced tenants are coordinated; establish standard telemetry operations mentorship across NVIDIA; verify device integrity, authenticity, and trust across the accelerated computing ecosystem; driving process improvements; automate manual workflows; ensure cross-functional visibility on overall program progress

Industry & Context.

deep learning
Problems you'll solve

troubleshoot; debug; resolving misaligned requirements; strategic thinking; tactical thinking

What They're Looking For.

Must Have

12+ years of technical program management experience, specifically driving the planning and execution of large-scale engineering, cloud infrastructure, and observability programs within a matrixed organization, Extensive practical experience in managing cloud infrastructure, ideally acquired from employment at a leading Cloud Service Provider (CSP), Proven ability to act as the interface between cross-functional organizations, effectively managing complex feedback loops and resolving misaligned requirements, Expert-level proficiency with Jira, Smartsheet, or similar program management tools, with the ability to confidently guide engineering teams on their effective use within an Agile framework, Outstanding strategic and tactical thinking abilities, coupled with a capacity to build consensus and drive program success across diverse business units, Excellent communication and technical presentation skills, particularly for executive audiences, BS or MS in Electrical Engineering or Computer Science, or equivalent experience

Nice to Have

Comprehensive knowledge of NVIDIA architectures and interconnects, encompassing deployment, bring-up, and telemetry requirements for GPUs, NVLink, and InfiniBand, Experience with technologies like Open Telemetry (OTel), Grafana, Warpstream, VictoriaMetrics, Loki, Familiarity with cloud platform architecture, cloud-native services, and Kubernetes, A highly enthusiastic, upbeat, responsive, and passionate individual who actively identifies process improvement opportunities and guides teams through ambiguity

What You'll Do.

Lead important programs focused on telemetry and data center fleet health & management

Develop core capabilities for DGX Cloud

Ensure operations and advanced tenants receive the required telemetry to troubleshoot

and manage AI infrastructure effectively

Establish a balanced feedback loop between DGX Cloud and other organizations at NVIDIA to align and unify telemetry requirements and mentorship for external partners

Drive the end-to-end telemetry lifecycle for upcoming NVIDIA Cloud Providers (NCPs)

Participate in the early product lifecycle (Day -1 / Day 0) to examine the Plan of Record (POR) for new silicon

and software architectures

Establish standard telemetry operations mentorship across NVIDIA

Drive a program for NVIDIA’s attestation platform to verify device integrity

and trust across the accelerated computing ecosystem

Provide leadership and mentorship to the DGXC TPM organization

driving process improvements

Actively improve day-to-day efficiency by demonstrating and building AI tools to automate manual workflows like Jira management

Develop and accomplish a robust communication strategy to ensure cross-functional visibility on overall program progress

Present regularly to NVIDIA's executive leadership team

How You'll Work.

Team & Collaboration

Collaborate with hardware/software supply teams, DGXC operations, and external Cloud Service Providers (CSPs and NCPs); Work closely with our Engineering, Infrastructure, and Software teams; Act as the interface between cross-functional organizations; Build consensus and drive program success across diverse business units

Communication Scope

excellent communication skills; technical presentation skills; presenting regularly to NVIDIA's executive leadership team; robust communication strategy

Process & Methodology

technical program management, planning, execution, driving the planning and execution of large-scale engineering, cloud infrastructure, and observability programs, managing cloud infrastructure, managing complex feedback loops, resolving misaligned requirements, program management tools, Agile framework, strategic thinking, tactical thinking, driving program success, communication strategy

Full Job Description

NVIDIA's deep learning platforms lead innovation, significantly influencing multiple fields and embraced by top academic institutions, startups, and major Internet companies worldwide. We're looking for a seasoned and highly skilled Principal Technical Program Manager (TPM) to join our NVIDIA DGX Cloud team. This is an exciting opportunity for a passionate, driven, and creative individual to provide outstanding value to our DGX Cloud customers. We are seeking a TPM who has deep knowledge of observability, systems telemetry, and cloud infrastructure operations. You will play a key role collaborating with hardware/software supply teams, DGXC operations, and external Cloud Service Providers (CSPs and NCPs). Together, you will develop unified telemetry mentorship and confirm operational readiness worldwide. **What you 'll be doing:** As a DGX Cloud Principal Technical Program Manager, you will work closely with our Engineering, Infrastructure, and Software teams. You will lead important programs focused on telemetry and data center fleet health & management. Your role will be essential in developing core capabilities for DGX Cloud. You will make sure that operations and advanced tenants receive the telemetry required to troubleshoot, debug, and manage AI infrastructure effectively. * Establishing a balanced feedback loop between DGX Cloud and other organizations at NVIDIA to align and unify telemetry requirements and mentorship for external partners. * Driving the end-to-end telemetry lifecycle for upcoming NVIDIA Cloud Providers (NCPs). This includes ensuring requirements are committed, delivered, and ingested into a centralized telemetry platform to enable DGXC operations. * Participating in the early product lifecycle (Day -1 / Day 0) to examine the Plan of Record (POR) for new silicon, systems, firmware, and software architectures (example: VR). This ensures telemetry requirements for advanced tenants are coordinated. * Collaborating across technical domains (includ

Free ATS check

Applying for this Senior Technical Program Manager, Cloud Infrastructure, Observability and Systems Monitoring role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →