NVIDIA
Technology
SeniorSystemsSoftwareEngineer,AIStackandPerformance-DGXStation
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Systems Software Engineer, AI Stack and Performance - DGX Station at NVIDIA. Skills: AI Stack Performance, Systems Software Engineering, GPU Optimization, Deep Learning Frameworks. Own AI application readiness. Define "ready to ship" criteria”
What You'll Achieve.
Deliver best-in-class performance; Increase throughput on DGX Station; Make performance data actionable
Industry & Context.
Root cause analysis; Troubleshooting
What They're Looking For.
Must Have
BS or MS or equivalent experience, 12+ years systems software engineering, Hands-on AI/ML workload optimization, GPU performance analysis experience, Deep learning infrastructure experience, Proficiency with PyTorch, TensorFlow, or JAX, Experience profiling GPU workloads, Experience optimizing GPU workloads, Ability to read GPU traces, Understanding of GPU architecture, Experience with inference optimization, Proficiency in C/C++, CUDA, and Python, Comfortable reading GPU kernels, Comfortable modifying GPU kernels
Nice to Have
Optimizing LLM training, Optimizing LLM inference, Multi-GPU NVIDIA systems experience, Contributions to open-source AI frameworks, Contributions to CUDA libraries, Contributions to inference engines, Multi-GPU communication optimization, NCCL tuning experience, NVLink utilization experience, Collective operations experience, Parallel training strategies experience, Collaborating with compiler teams, Collaborating with hardware architecture teams, Experience shipping AI-powered products
What You'll Do.
Own AI application readiness
Define "ready to ship" criteria
Close performance gaps
Profile LLM workloads
Optimize LLM workloads
Characterize performance
Identify performance regressions
Implement optimizations
Improve kernel fusion
Improve graph execution
Improve operator scheduling
Improve memory management
Translate platform constraints
Validate multi-user scenarios
Validate concurrent workload scenarios
Ensure reliable performance
Validate NVIDIA AI software stack
Ensure version compatibility
Ensure functional correctness
Ensure performance parity
Build benchmarking infrastructure
Maintain benchmarking infrastructure
Make performance data visible
Support customer deployment readiness
Support field critical issues
How You'll Work.
Team & Collaboration
Cross functional teams; Framework teams; Compiler teams; GPU architecture teams; Product management; OEM/OSV partners
Communication Scope
Performance data visibility
Full Job Description
DGX Station (Galaxy) is NVIDIA’s workstation-class AI computer—built on GB300 Blackwell GPUs with NVLink interconnect, delivering data-center-grade AI compute in a deskside form factor. DGX Station is shipped to OEM and OSV partners as a complete SW/FW GA release including firmware bundles, DGX BaseOS, GPU drivers, CUDA toolkit, DCGM, and DOCA/OFED. For DGX Station to deliver on its promise, AI applications like NemoClaw, LLM inference via NIM, Hermes agents, and deep learning frameworks must run production-ready out of the box—optimized for the multi-GPU, high-bandwidth architecture of this platform. We are looking for a deeply technical systems software engineer who will own AI stack readiness on DGX Station. You will profile workloads, identify bottlenecks across GPU compute, NVLink, memory, and host interconnects, drive optimizations across the full stack—from GPU kernels through frameworks to applications—and work hands-on with framework, compiler, and GPU architecture teams to ensure DGX Station delivers best-in-class performance for real AI workloads in multi-user and multi-GPU configurations. **What you’ll be doing:** * AI Application Readiness: Own production readiness of AI applications on DGX Station—NemoClaw, Hermes agents, NIM microservices, and key customer workloads. Define “ready to ship” criteria, run validation, and close every gap between “it runs” and “it runs well” across single-GPU and multi-GPU configurations. * DL Framework Performance: Work cross functionally with different orgs to profile and optimize LLM and deep learning workloads (PyTorch, TensorFlow, JAX) across training and inference on the GB300 Blackwell multi-GPU architecture. Characterize performance across model sizes, batch sizes, precision modes (FP16, INT8, FP8), and GPU scaling (single-GPU vs. multi-GPU with NVLink) to establish benchmarks and identify regression. * System-Level Optimization: Identify bottlenecks in GPU compute, NVLink bandwidth, host memory, PCIe, and CPU–GPU
Applying for this Senior Systems Software Engineer, AI Stack and Performance - DGX Station role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.