NVIDIA
deep learning
SeniorTechnicalProgramManager,CloudInfrastructure,ObservabilityandSystemsMonitoring
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Technical Program Manager, Cloud Infrastructure, Observability and Systems Monitoring at NVIDIA. Skills: Technical Program Management, Cloud Infrastructure, Observability, Systems Monitoring, Telemetry. Lead important programs focused on telemetry and data center fleet health & management. Develop core capabilities for DGX Cloud”
What You'll Achieve.
provide outstanding value to our DGX Cloud customers; develop unified telemetry mentorship and confirm operational readiness worldwide; ensure requirements are committed, delivered, and ingested into a centralized telemetry platform to enable DGXC operations; ensure telemetry requirements for advanced tenants are coordinated; establish standard telemetry operations mentorship across NVIDIA; verify device integrity, authenticity, and trust across the accelerated computing ecosystem; driving process improvements; automate manual workflows; ensure cross-functional visibility on overall program progress
Industry & Context.
troubleshoot; debug; resolving misaligned requirements; strategic thinking; tactical thinking
What They're Looking For.
Must Have
12+ years of technical program management experience, specifically driving the planning and execution of large-scale engineering, cloud infrastructure, and observability programs within a matrixed organization, Extensive practical experience in managing cloud infrastructure, ideally acquired from employment at a leading Cloud Service Provider (CSP), Proven ability to act as the interface between cross-functional organizations, effectively managing complex feedback loops and resolving misaligned requirements, Expert-level proficiency with Jira, Smartsheet, or similar program management tools, with the ability to confidently guide engineering teams on their effective use within an Agile framework, Outstanding strategic and tactical thinking abilities, coupled with a capacity to build consensus and drive program success across diverse business units, Excellent communication and technical presentation skills, particularly for executive audiences, BS or MS in Electrical Engineering or Computer Science, or equivalent experience
Nice to Have
Comprehensive knowledge of NVIDIA architectures and interconnects, encompassing deployment, bring-up, and telemetry requirements for GPUs, NVLink, and InfiniBand, Experience with technologies like Open Telemetry (OTel), Grafana, Warpstream, VictoriaMetrics, Loki, Familiarity with cloud platform architecture, cloud-native services, and Kubernetes, A highly enthusiastic, upbeat, responsive, and passionate individual who actively identifies process improvement opportunities and guides teams through ambiguity
What You'll Do.
Lead important programs focused on telemetry and data center fleet health & management
Develop core capabilities for DGX Cloud
Ensure operations and advanced tenants receive the required telemetry to troubleshoot
and manage AI infrastructure effectively
Establish a balanced feedback loop between DGX Cloud and other organizations at NVIDIA to align and unify telemetry requirements and mentorship for external partners
Drive the end-to-end telemetry lifecycle for upcoming NVIDIA Cloud Providers (NCPs)
Participate in the early product lifecycle (Day -1 / Day 0) to examine the Plan of Record (POR) for new silicon
and software architectures
Establish standard telemetry operations mentorship across NVIDIA
Drive a program for NVIDIA’s attestation platform to verify device integrity
and trust across the accelerated computing ecosystem
Provide leadership and mentorship to the DGXC TPM organization
driving process improvements
Actively improve day-to-day efficiency by demonstrating and building AI tools to automate manual workflows like Jira management
Develop and accomplish a robust communication strategy to ensure cross-functional visibility on overall program progress
Present regularly to NVIDIA's executive leadership team
How You'll Work.
Team & Collaboration
Collaborate with hardware/software supply teams, DGXC operations, and external Cloud Service Providers (CSPs and NCPs); Work closely with our Engineering, Infrastructure, and Software teams; Act as the interface between cross-functional organizations; Build consensus and drive program success across diverse business units
Communication Scope
excellent communication skills; technical presentation skills; presenting regularly to NVIDIA's executive leadership team; robust communication strategy
Process & Methodology
technical program management, planning, execution, driving the planning and execution of large-scale engineering, cloud infrastructure, and observability programs, managing cloud infrastructure, managing complex feedback loops, resolving misaligned requirements, program management tools, Agile framework, strategic thinking, tactical thinking, driving program success, communication strategy
Full Job Description
NVIDIA's deep learning platforms lead innovation, significantly influencing multiple fields and embraced by top academic institutions, startups, and major Internet companies worldwide. We're looking for a seasoned and highly skilled Principal Technical Program Manager (TPM) to join our NVIDIA DGX Cloud team. This is an exciting opportunity for a passionate, driven, and creative individual to provide outstanding value to our DGX Cloud customers. We are seeking a TPM who has deep knowledge of observability, systems telemetry, and cloud infrastructure operations. You will play a key role collaborating with hardware/software supply teams, DGXC operations, and external Cloud Service Providers (CSPs and NCPs). Together, you will develop unified telemetry mentorship and confirm operational readiness worldwide. **What you 'll be doing:** As a DGX Cloud Principal Technical Program Manager, you will work closely with our Engineering, Infrastructure, and Software teams. You will lead important programs focused on telemetry and data center fleet health & management. Your role will be essential in developing core capabilities for DGX Cloud. You will make sure that operations and advanced tenants receive the telemetry required to troubleshoot, debug, and manage AI infrastructure effectively. * Establishing a balanced feedback loop between DGX Cloud and other organizations at NVIDIA to align and unify telemetry requirements and mentorship for external partners. * Driving the end-to-end telemetry lifecycle for upcoming NVIDIA Cloud Providers (NCPs). This includes ensuring requirements are committed, delivered, and ingested into a centralized telemetry platform to enable DGXC operations. * Participating in the early product lifecycle (Day -1 / Day 0) to examine the Plan of Record (POR) for new silicon, systems, firmware, and software architectures (example: VR). This ensures telemetry requirements for advanced tenants are coordinated. * Collaborating across technical domains (includ
Applying for this Senior Technical Program Manager, Cloud Infrastructure, Observability and Systems Monitoring role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.