NVIDIA
Technology
DevOpsEngineer,HPCandLSF
Neural analysis suggests this role is
optimal for Mid candidates.
“DevOps Engineer, HPC and LSF at NVIDIA. Skills: HPC, LSF, DevOps, Automation. Manage workload schedulers. Support resource schedulers”
What You'll Achieve.
Improve engineer productivity; Improve time to market
Industry & Context.
Problem-solving; Troubleshooting; Performance optimization
What They're Looking For.
Must Have
3+ years experience in large, distributed Linux environment, BS in Computer Science or equivalent experience
Nice to Have
Experience analyzing and tuning performance for HPC or EDA workloads, Solid understanding of cluster configuration management tools, Proficiency in Perl for legacy automation scripts, Deep understanding of distributed system principles
What You'll Do.
Manage workload schedulers
Support resource schedulers
Automate configuration management
Automate operational monitoring
Develop solutions for computing resource management
Extract grid performance metrics
Leverage grid performance metrics
Troubleshoot complex issues
Ensure system reliability
Ensure system efficiency
Develop standard methodologies
Define standard methodologies
Document standard methodologies
Collaborate with domain experts
Improve chip development process utilization
How You'll Work.
Team & Collaboration
Work with diverse teams; Collaborate with domain experts
Full Job Description
NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. **What you’ll be doing:** * Manage and support workload and resource schedulers in a large-scale HPC environment. * Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring. * Develop solutions for complex computing resource management requirements. * Extract and leverage grid performance metrics for troubleshooting and performance optimization. * Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency. * Develop, define a
Applying for this DevOps Engineer, HPC and LSF role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.