NVIDIA

Technology

DevOpsEngineer,HPCandLSF

₹22–35L ~AI est. Bengaluru, India FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid candidates.

The Brief

“DevOps Engineer, HPC and LSF at NVIDIA. Skills: HPC, LSF, DevOps, Automation. Manage workload schedulers. Support resource schedulers”

What You'll Achieve.

Improve engineer productivity; Improve time to market

Industry & Context.

Technology

Problems you'll solve

Problem-solving; Troubleshooting; Performance optimization

What They're Looking For.

Must Have

3+ years experience in large, distributed Linux environment, BS in Computer Science or equivalent experience

Nice to Have

Experience analyzing and tuning performance for HPC or EDA workloads, Solid understanding of cluster configuration management tools, Proficiency in Perl for legacy automation scripts, Deep understanding of distributed system principles

What You'll Do.

Manage workload schedulers

Support resource schedulers

Automate configuration management

Automate operational monitoring

Develop solutions for computing resource management

Extract grid performance metrics

Leverage grid performance metrics

Troubleshoot complex issues

Ensure system reliability

Ensure system efficiency

Develop standard methodologies

Define standard methodologies

Document standard methodologies

Collaborate with domain experts

Improve chip development process utilization

How You'll Work.

Team & Collaboration

Work with diverse teams; Collaborate with domain experts

Full Job Description

NVIDIA is the leader in AI, machine learning and datacenter acceleration. NVIDIA is expanding that leadership into datacenter networking with ethernet switches, NICs and DPUs NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that powers all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineer's productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. **What you’ll be doing:** * Manage and support workload and resource schedulers in a large-scale HPC environment. * Automate Everything: Develop automation scripts to automate deployment, configuration management, and operational monitoring. * Develop solutions for complex computing resource management requirements. * Extract and leverage grid performance metrics for troubleshooting and performance optimization. * Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency. * Develop, define a

Free ATS check

Applying for this DevOps Engineer, HPC and LSF role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 23 detected · ranked by frequency

HPC ×3

LSF ×3

Job scheduler administration ×3

Deployment automation ×3

Configuration management ×3

Operational monitoring ×3

Performance metrics extraction ×3

Troubleshooting ×3

Standard methodologies development ×3

DevOps ×2

Automation ×2

Docker ×2

Linux

Python

Perl

Resource management

Performance optimization

System reliability

Chip development process

SLURM

CentOS

RHEL

Ansible

BEHAVIOURAL

Problem-solvingTeamwork

Role Details

Seniority mid

Experience 3–5 yrs

Level Mid

Work Mode Remote

Type FULL TIME

Education Bachelor's

Salary Band 200k+

AI-Extracted Insights

Domain Areas

hpc-environmentdistributed-systemscontainer-technologies

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →