NVIDIA
Artificial Intelligence
SeniorSystemSoftwareEngineer,NCCL-PartnerEnablement
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior System Software Engineer, NCCL - Partner Enablement at NVIDIA. Skills: NCCL, C/C++ programming, high performance networking, Linux, Python, parallel programming, communication runtime. Engage with our partners and customers to root cause functional and performance issues reported with NCCL. Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters”
Industry & Context.
root cause functional and performance issues; isolate issues
What They're Looking For.
Must Have
B. S. /M. S. degree in CS/CE or equivalent experience with 5+ years of relevant experience, Experience with parallel programming, at least one communication runtime (MPI, NCCL, UCX, NVSHMEM), Excellent C/C++ programming skills, debugging, profiling, code optimization, performance analysis, test design, Experience working with engineering or academic research community supporting HPC or AI, Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control, Expert in Linux fundamentals, a scripting language, preferably Python, Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible), Adaptability and passion to learn new areas and tools, Flexibility to work and communicate effectively across different teams and timezones
Nice to Have
Experience conducting performance benchmarking, developing infrastructure on HPC clusters, Prior system administration experience, esp for large clusters, Experience debugging network configuration issues in large scale deployments, Familiarity with CUDA programming and/or GPUs, Good understanding of Machine Learning concepts, experience with Deep Learning Frameworks such PyTorch, TensorFlow, Deep understanding of technology
What You'll Do.
Engage with our partners and customers to root cause functional and performance issues reported with NCCL
Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
Develop tools and automation to isolate issues on new systems and platforms
including cloud platforms (Azure
Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
Document and conduct trainings/webinars for NCCL
Engage with internal teams in different time zones on networking
infrastructure and support
How You'll Work.
Team & Collaboration
Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support; communicate effectively across different teams and timezones
Communication Scope
communicate effectively across different teams and timezones
Full Job Description
NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. Come work for the team that brought to you NCCL, NVSHMEM & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications! We are looking for a motivated Partner Enablement Engineer to guide our key partners and customers with NCCL. Most DL/HPC applications run on large clusters with high-speed networking (Infiniband, RoCE, Ethernet). This is an outstanding opportunity to get an end to end understanding of the AI networking stack. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision? **What you will be doing:** * Engage with our partners and customers to root cause functional and performance issues reported with NCCL * Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters * Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.) * Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters * Document and conduct trainings/webinars for NCCL * Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support. **What we need to see:** * B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience. Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM) * Excellent C/C++ programming skills, including debugging, profiling, code optimizati
Applying for this Senior System Software Engineer, NCCL - Partner Enablement role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.