NVIDIA

AI computing

Director,SiteReliabilityandSoftwareEngineering-DGXCloud

$320–575k Santa Clara, California, United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Director candidates.

The Brief

“Director, Site Reliability and Software Engineering - DGX Cloud at NVIDIA. Skills: Site Reliability and Software Engineering, multi-colo distributed NVIDIA GPU cloud clusters, cluster automation, operational excellence, scalable distributed systems, technical leadership, management, Deep learning platform. manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters. contribute to product strategy”

What You'll Achieve.

deliver projects in tight deadlines; maximize decision making and executive visibility

Industry & Context.

AI computing

Problems you'll solve

resolving blockages; solve reliability both our internally critical and our externally-visible systems

What They're Looking For.

Must Have

12+ overall years of Experience in engineering management, 5+ years of leadership, Bachelor / Master degree in Computer Science, or equivalent experience, Experience in designing and implementing large-scale distributed systems, Experience in Containers / Virtualization environments/ Cluster solutions, Experience in managing Technical Support / DevOps teams, knowledge in Unix/Linux, Experience implementing tools, process, internal instrumentation, methodologies and resolving blockages, Demonstrated people management and leadership skills, Ability to quickly learn and evaluate new technologies, Ability to influence and establish relationships with other software and IT functional groups such as development, server, storage and security teams

Nice to Have

proven track record of mentoring and coaching team members

What You'll Do.

and operations of the multi-colo distributed NVIDIA GPU cloud clusters

contribute to product strategy

leader for all aspects of cluster automation and operational excellence planning

Manage a team of Software and Site Reliability engineers

including program development

task planning and code reviews

Define team strategy and roadmap

and drive adoption of scalable SDLC practices

and modern practices Nvidia’s DGX Cloud Computing environment

Drive technical projects and provide leadership in an innovative and fast-paced environment

Be responsible for the overall planning

tracking and success of technical projects

Contribute technically to the technical projects for DGX Cloud Computing Services

Interact with key internal stakeholders to provide operational and financial clarity on technical spend

Drive Decision making

visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting

Lead efforts related to executive reporting

and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility

How You'll Work.

Team & Collaboration

work with customers across the entire company; work across multiple different levels of technical and organizational leadership; Work closely with project and product management teams; Interact with key internal stakeholders; establish relationships with other software and IT functional groups such as development, server, storage and security teams

Process & Methodology

program development, task planning, overall planning, tracking and success of technical projects, project & portfolio reporting

Full Job Description

NVIDIA's invention of the GPUs ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company”. We are looking to grow our company, and grow our teams with the smartest people in the world. We are looking for you. NVIDIA's GPU is hitting in market for Deep learning which is used in the research community and in industry to help solve many big data problems such as computer vision, speech recognition & translation, life science, image recognition, and natural language processing. NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform that runs everywhere. Data scientists and researchers can now rapidly build, train, and deploy neural network models to address some of the most complicated AI challenges. In this Environment, NVIDIA GPU Cloud computing team is looking for leaders to work for world class Deep learning platform. **What you 'll be doing:** As a Site Reliability and Software Engineering leader in the DGXC Cloud Reliability organization, you will manage the software, automation, and operations of the multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy. You will be the leader for all aspects of cluster automation and operational excellence planning and grow your team. You thrive in a fast-paced iterative engineering environment and have experience delivering scalable distributed systems. Most importantly, you will have a track record of having past teams and cross-functional partners respect you as both a technical leader and manager, and are able to work via influence and not direct authority when needed. NVIDIA GPU Cloud Computing team works with customers across the entire company, and the ability to work across multiple different levels of technical and organizational leadership is critical. Operating with scale and speed, our world-class software engineers are

Free ATS check

Applying for this Director, Site Reliability and Software Engineering - DGX Cloud role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 41 detected · ranked by frequency

engineering management ×3

leadership ×3

designing and implementing large-scale distributed systems ×3

managing Technical Support / DevOps teams ×3

technical excellent bars ×3

delivering projects in tight deadlines ×3

implementing tools, process, internal instrumentation, methodologies and resolving blockages ×3

people management ×3

Site Reliability and Software Engineering ×2

multi-colo distributed NVIDIA GPU cloud clusters ×2

cluster automation ×2

operational excellence ×2

scalable distributed systems ×2

technical leadership ×2

management ×2

Deep learning platform ×2

Unix/Linux ×2

GPU

Deep learning

Containers

Virtualization

Cluster solutions

SDLC practices

test infrastructure

program development

task planning

product strategy

operational excellence planning

scalable SDLC practices

technical projects

product development

BEHAVIOURAL

thrive in a fast-paced iterative engineering environmentexperience delivering scalable distributed systemstrack record of having past teams and cross-functional partners respect you as both a technical leader and managerwork via influence and not direct authority when neededability to work across multiple different levels of technical and organizational leadershipcreative and autonomous engineerreal passion for technologymentoring and coaching team members

Role Details

Seniority executive

Experience 12–+ yrs

Level Director

Work Mode Remote Friendly

Type FULL TIME

Education Bachelor / Master degree in Computer Science, or equivalent

Salary Band 200k+

AI-Extracted Insights

Domain Areas

deep-learningai-computingnvidia-gpu-cloud-ngcgpu-accelerated-platformcomputer-visionspeech-recognition-translationlife-scienceimage-recognition

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →