NVIDIA

AI

SeniorSoftwareEngineer,CUDADeepLearningSystems

$184–357k Santa Clara, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Software Engineer, CUDA Deep Learning Systems at NVIDIA. Skills: CUDA, Deep Learning, C++, Python, distributed computing, systems programming, computer architecture, kernel optimization. Explore, research, and prototype novel systems optimizations for advanced deep learning models at the intersection of high-level DL frameworks and low-level CUDA through modeling, simulation, and silicon prototyping. Architect and optimize distributed computing systems that scale seamlessly from a single ”

What You'll Achieve.

unlock maximum hardware performance for emerging AI workloads; improve accelerator compute utilization, memory bandwidth, cross-node network communication efficiency and programmability; accelerate new paradigms in deep learning; ensure exploratory prototypes can smoothly transition into open-source releases, upstream framework integrations, internal tools, or closed-source commercial products

Industry & Context.

AI
Problems you'll solve

identify and resolve performance bottlenecks; analytical approach

What They're Looking For.

Must Have

BS, MS, or PhD degree in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience), 8+ years of relevant industry experience or equivalent academic experience after degree achievement, proficiency in C++ and Python programming, Solid background in the fundamentals of Deep Learning with a focus on transformers, understanding of distributed computing principles, multi-node scaling, and the unique performance challenges of cluster-scale execution, Proven experience in systems programming, computer architecture, and low-level systems performance optimization, Familiarity with deep learning accelerator architectures such as the GPU and hands-on experience with CUDA programming and kernel optimization, A analytical approach with experience using profiling tools to deeply understand software performance on hardware, Experience profiling and optimizing innovative vision models, generative AI architectures, or diffusion models, Background in deep learning compilers, both graph-level and codegen (e. g. , Triton, XLA, torch compile)

Nice to Have

Deep expertise in the performance internals and execution graphs of major deep learning autograd, training and inference frameworks (e. g. , PyTorch, JAX, TensorRT, vLLM, sgLang, Nemo, Megatron, MaxText, etc.), Hands-on experience with CUDA, communication libraries (e. g. , NCCL, MPI, UCX) and distributed machine learning techniques (e. g. , pipeline parallelism, tensor parallelism), Knowledge of numerical methods, low-precision arithmetic (e. g. , NVFP4, MXFP4, FP8, INT8), and their implications on deep learning model accuracy and performance, Familiarity with systems requirements for Reinforcement Learning (RL) or highly parallel simulation environments and/or research background in machine learning systems or adjacent fields, Experience with machine learning, especially agentic systems, applied to systems problems

What You'll Do.

and prototype novel systems optimizations for advanced deep learning models at the intersection of high-level DL frameworks and low-level CUDA through modeling

and silicon prototyping

Architect and optimize distributed computing systems that scale seamlessly from a single node to massive

cluster-scale supercomputing environments

and optimize custom high-performance CUDA kernels tailored to emerging neural network architectures and workloads

Analyze complex hardware-software interactions to identify and resolve performance bottlenecks in both training and inference pipelines

Develop exploratory tools and runtime systems to profile and accelerate new paradigms in deep learning

and maintainable code

ensuring exploratory prototypes can smoothly transition into open-source releases

upstream framework integrations

or closed-source commercial products

How You'll Work.

Team & Collaboration

Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts to co-design systems and algorithms that improve accelerator compute utilization, memory bandwidth, cross-node network communication efficiency and programmability

Full Job Description

We are looking for an experienced and highly motivated software professional to work on pioneering initiatives and projects at the intersection of CUDA and Deep Learning Systems. As the complexity and scale of artificial intelligence continue to grow, the intersection of advanced deep learning architectures, massive-scale distributed computing, and low-level hardware optimization has never been more critical. Our team is dedicated to exploring and prototyping next-generation ideas that bridge the gap between deep learning algorithms and CUDA, pushing the boundaries of what is possible on modern accelerator architectures. Join our dynamic, research-oriented team to help unlock maximum hardware performance for emerging AI workloads. You will be a crucial member of a highly technical group exploring uncharted territories in model optimization, custom kernel development, and cluster-scale AI systems design. If you are passionate about the fundamentals of deep learning and thrive on squeezing every ounce of performance out of advanced computing systems from a single GPU to supercomputer clusters, we want you on our team! **What you will be doing:** * Explore, research, and prototype novel systems optimizations for advanced deep learning models at the intersection of high-level DL frameworks and low-level CUDA through modeling, simulation, and silicon prototyping. * Architect and optimize distributed computing systems that scale seamlessly from a single node to massive, cluster-scale supercomputing environments. * Design, implement, and optimize custom high-performance CUDA kernels tailored to emerging neural network architectures and workloads. * Analyze complex hardware-software interactions to identify and resolve performance bottlenecks in both training and inference pipelines. * Collaborate closely with AI researchers, HW and SW architects, kernel and compiler authors and CUDA driver experts to co-design systems and algorithms that improve accelerator compute utiliza

Free ATS check

Applying for this Senior Software Engineer, CUDA Deep Learning Systems role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →