NVIDIA

SeniorServiceReliabilityEngineer-EDAInfrastructure

Bengaluru, India FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Service Reliability Engineer - EDA Infrastructure at NVIDIA. Skills: Service Reliability Engineering, Linux System Administration, Automation, High-Availability Systems, Incident Response. build and operate a global Service Reliability Operations Center supporting Hardware Infrastructure. ensuring scalability, resilience, and near 100% availability”

What You'll Achieve.

ensuring scalability, resilience, and near 100% availability; improve reliability; reduce incident frequency and impact; drive rapid resolution when issues occur; proactively detect issues; enhance the customer experience; measure the effectiveness of our production environments

Industry & Context.

Problems you'll solve

diagnose issues; identify root causes; implement effective resolutions

Eligibility Requirements

Operate in a 24/7 follow-the-sun support model, Work a 4-day, 10-hour schedule, including either Saturday or Sunday, with flexible early or late shifts

What They're Looking For.

Must Have

5+ years of experience administering large-scale production systems, 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC), Expert-level knowledge of Linux system administration, automation using Ansible and/or Python

Nice to Have

Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management, Familiarity with GPU hardware and high-performance computing environments, Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA), Cloud experience (AWS, Azure, GCP) is a preference for on-prem expertise

What You'll Do.

build and operate a global Service Reliability Operations Center supporting Hardware Infrastructure

and near 100% availability

reduce incident frequency and impact

drive rapid resolution when issues occur

and observability solutions

operate in a 24/7 follow-the-sun support model

Monitor and manage large-scale production compute and storage environments

and observability tools to proactively detect

and respond to incidents

Apply deep systems knowledge to analyze logs

and system behavior to diagnose issues

and implement effective resolutions

How You'll Work.

Team & Collaboration

collaborate with SRE, Security, and DevOps; partner with development teams; work successfully with multi-functional teams; coordinating effectively across organizational boundaries and geographies

Communication Scope

communication skills

Full Job Description

NVIDIA is seeking System Administrator/DevOps Engineers to help build and operate a global Service Reliability Operations Center supporting our Hardware Infrastructure. This role focuses on ensuring scalability, resilience, and near 100% availability. As part of this team, you will collaborate with SRE, Security, and DevOps to improve reliability, reduce incident frequency and impact, and drive rapid resolution when issues occur. You will partner with development teams to implement monitoring, alerting, and observability solutions that proactively detect issues and enhance the customer experience. You will also help evaluate and select the tools and technologies used to monitor, operate, and measure the effectiveness of our production environments. **What you will be doing:** * Operate in a 24/7 follow-the-sun support model spanning multiple continents, with direct reporting to a U.S.-based manager * Work a 4-day, 10-hour schedule, including either Saturday or Sunday, with flexible early or late shifts to ensure continuous global coverage across U.S. and India teams * Monitor and manage large-scale production compute and storage environments to ensure high availability and performance * Utilize alerts, alarms, and observability tools to proactively detect, prevent, and respond to incidents * Apply deep systems knowledge to analyze logs, metrics, and system behavior to diagnose issues, identify root causes, and implement effective resolutions **What we need to see:** * Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies. * 5+ years of experience administering large-scale production systems. * 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC). * BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experi

Free ATS check

Applying for this Senior Service Reliability Engineer - EDA Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →