NVIDIA
Technology
SeniorSystemsSoftwareEngineer,AIStackandPerformance-DGXStation
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Systems Software Engineer, AI Stack and Performance - DGX Station at NVIDIA. Skills: AI Stack Performance, GPU Optimization, Systems Software Engineering, Deep Learning Frameworks. Own AI application readiness. Define "ready to ship" criteria”
Industry & Context.
Root cause analysis; Troubleshooting
What They're Looking For.
Must Have
BS or MS or equivalent experience, 12+ years systems software engineering, Hands-on AI/ML workload optimization, GPU performance analysis experience, Deep learning infrastructure experience, Proficiency with PyTorch, TensorFlow, or JAX, Experience profiling GPU workloads, Experience optimizing GPU workloads, Ability to read GPU traces, Understanding of GPU architecture, Experience with inference optimization, Proficiency in C/C++, CUDA, and Python, Comfortable reading GPU kernels, Comfortable modifying GPU kernels
Nice to Have
Optimizing LLM training on multi-GPU NVIDIA systems, Optimizing LLM inference on multi-GPU NVIDIA systems, Contributions to open-source AI frameworks, Contributions to CUDA libraries, Contributions to inference engines, Multi-GPU communication optimization experience, NCCL tuning experience, NVLink utilization experience, Collective operations experience, Parallel training strategies experience, Collaborating with compiler teams, Collaborating with hardware architecture teams, Experience shipping AI-powered products
What You'll Do.
Own AI application readiness
Define "ready to ship" criteria
Close performance gaps
Profile LLM workloads
Optimize LLM workloads
Characterize performance
Identify performance regression
Implement optimizations
Improve kernel fusion
Improve graph execution
Improve operator scheduling
Improve memory management
Translate platform constraints
Validate multi-user scenarios
Validate concurrent workload scenarios
Ensure reliable performance
Validate NVIDIA AI software stack
Ensure version compatibility
Ensure functional correctness
Ensure performance parity
Build benchmarking infrastructure
Maintain benchmarking infrastructure
Make performance data visible
Make performance data actionable
Understand target use cases
Ensure compelling performance
Support customer deployment readiness
Support field critical issues
How You'll Work.
Team & Collaboration
Cross functional teams; Framework teams; Compiler teams; GPU architecture teams; Product management; OEM/OSV partners
Full Job Description
DGX Station (Galaxy) is NVIDIA’s workstation-class AI computer—built on GB300 Blackwell GPUs with NVLink interconnect, delivering data-center-grade AI compute in a deskside form factor. DGX Station is shipped to OEM and OSV partners as a complete SW/FW GA release including firmware bundles, DGX BaseOS, GPU drivers, CUDA toolkit, DCGM, and DOCA/OFED. For DGX Station to deliver on its promise, AI applications like NemoClaw, LLM inference via NIM, Hermes agents, and deep learning frameworks must run production-ready out of the box—optimized for the multi-GPU, high-bandwidth architecture of this platform. We are looking for a deeply technical systems software engineer who will own AI stack readiness on DGX Station. You will profile workloads, identify bottlenecks across GPU compute, NVLink, memory, and host interconnects, drive optimizations across the full stack—from GPU kernels through frameworks to applications—and work hands-on with framework, compiler, and GPU architecture teams to ensure DGX Station delivers best-in-class performance for real AI workloads in multi-user and multi-GPU configurations. **What you’ll be doing:** * AI Application Readiness: Own production readiness of AI applications on DGX Station—NemoClaw, Hermes agents, NIM microservices, and key customer workloads. Define “ready to ship” criteria, run validation, and close every gap between “it runs” and “it runs well” across single-GPU and multi-GPU configurations. * DL Framework Performance: Work cross functionally with different orgs to profile and optimize LLM and deep learning workloads (PyTorch, TensorFlow, JAX) across training and inference on the GB300 Blackwell multi-GPU architecture. Characterize performance across model sizes, batch sizes, precision modes (FP16, INT8, FP8), and GPU scaling (single-GPU vs. multi-GPU with NVLink) to establish benchmarks and identify regression. * System-Level Optimization: Identify bottlenecks in GPU compute, NVLink bandwidth, host memory, PCIe, and CPU–GPU
Applying for this Senior Systems Software Engineer, AI Stack and Performance - DGX Station role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.