NVIDIA

Artificial Intelligence, High Performance Computing and Visualization

SeniorDeepLearningFrameworksCUDASoftwareEngineer

$184–357k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Deep Learning Frameworks CUDA Software Engineer at NVIDIA. Skills: Deep Learning Frameworks, CUDA, Distributed Runtime, AI. Integrate new CUDA features and Runtime abstractions in AI frameworks. Perform deep analysis of AI workloads and frameworks”

What You'll Achieve.

Bring advanced CUDA features and Distributed Runtime technologies into AI stacks; Improve productivity and performance of AI applications; Accelerate enabling AI toolkits for the community; Build speed-of-light multi-GPU multi-node solutions; Facilitate building next-gen DL frameworks; Enhance performance and programmability; Ensure exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products

Industry & Context.

Artificial Intelligence, High Performance Computing and Visualization
Problems you'll solve

Deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate; Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads

What They're Looking For.

Must Have

BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience), 8+ years of relevant industry experience or equivalent academic experience after completed degree, Development experience with Deep Learning Frameworks such PyTorch, JAX, and Inference Engines such as TRT-LLM, vLLM, SGLang, Rapid prototyping and development with Python, C++, CUDA or related DSLs, Solid grasp of AI models, parallelisms, and/or compiler technologies (e. g. torch. compile), Experience conducting performance benchmarking on AI clusters, Familiarity with at least one performance profiler toolchain (PyTorch profiler, NVIDIA Nsight Systems), Understanding of HPC/AI communication concepts, Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals), Adaptability and passion to learn new frameworks and tools, Flexibility to work and communicate effectively across different teams and timezones

Nice to Have

Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e. g. , PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc. ), Hands-on experience with CUDA, specific communication libraries (e. g. , NCCL, MPI, UCX) and distributed machine learning techniques (e. g. , pipeline parallelism, tensor parallelism), Expertise in one or more of these areas: Training, Distributed inference, MoE, Reinforcement Learning, kernel authoring (on CUDA, Triton, cuTe, etc), Background in deep learning compilers, both graph-level and codegen (e. g. , Triton, XLA, torch compile), Experience with programming for compute & communication overlap in distributed runtime

What You'll Do.

Integrate new CUDA features and Runtime abstractions in AI frameworks

Perform deep analysis of AI workloads and frameworks

Identify requirements and opportunities to innovate in the lower layers of the stack

Own and drive improvements in the AI Compiler-Runtime interface

Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads

Influence the roadmap of core CUDA

Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning

and maintainable code

How You'll Work.

Team & Collaboration

Collaborate hands-on with teams working on the latest AI models; Collaborate with a very dynamic team across multiple time zones; Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts; Communicate effectively across different teams and timezones

Communication Scope

Communicate effectively across different teams and timezones

Full Job Description

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are looking for a motivated Deep Learning engineer to bring advanced CUDA features and Distributed Runtime technologies into AI stacks, including PyTorch, TRT-LLM, vLLM, SGLang, JAX, etc. You will be working with the team that created core CUDA features and runtimes for scaling Deep Learning and HPC applications. Your customers will have diverse multi-GPU demands, ranging from training on scales up to 100K GPUs to inference down at microsecond latency. CUDA features improve both productivity and performance of AI applications. Your work in AI toolkits will accelerate enabling those for the community. This is an outstanding opportunity for someone with an AI background to advance the state of the art in this space. Are you ready to contribute to the development of innovative technologies and help realize NVIDIA's vision? **What you will be doing:** * Integrate new CUDA features and Runtime abstractions in AI frameworks: from PoC to performance analysis to production * Perform deep analysis of AI workloads and frameworks to identify requirements and opportunities to innovate in the lower layers of the stack. Collaborate hands-on with teams working on the latest AI models. * Own and drive improvements in the AI Compiler-Runtime interface to build speed-of-light multi-GPU multi-node solutions. * Design fault-tolerant and elastic solutions for large-scale or dynamic AI workloads. * Influence the roadmap of core CUDA to facilitate building next-gen DL frameworks. * Collaborate with a very dynamic team across multiple time zones. * Colla

Free ATS check

Applying for this Senior Deep Learning Frameworks CUDA Software Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →