NVIDIA
SeniorServiceReliabilityEngineer-EDAInfrastructure
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Service Reliability Engineer - EDA Infrastructure at NVIDIA. Skills: Service Reliability Engineering, Linux system administration, automation, Kubernetes, observability. build and operate a global Service Reliability Operations Center supporting our Hardware Infrastructure. ensuring scalability, resilience, and near 100% availability”
What You'll Achieve.
ensuring scalability, resilience, and near 100% availability; reduce incident frequency and impact; drive rapid resolution when issues occur; proactively detect issues and enhance the customer experience; ensure high availability and performance
Industry & Context.
diagnose issues; identify root causes; implement effective resolutions
Operate in a 24/7 follow-the-sun support model, Work a 4-day, 10-hour schedule, including either Saturday or Sunday, with flexible early or late shifts
What They're Looking For.
Must Have
5+ years of experience administering large-scale production systems, 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC), Expert-level knowledge of Linux system administration, automation using Ansible and/or Python
Nice to Have
Advanced hands-on experience with Kubernetes, SLURM, and large-scale cluster management, Familiarity with GPU hardware and high-performance computing environments, Experience with observability and incident management tools (Grafana, OpenTelemetry, PagerDuty, JIRA), Cloud experience (AWS, Azure, GCP) is a preference for on-prem expertise
What You'll Do.
build and operate a global Service Reliability Operations Center supporting our Hardware Infrastructure
and near 100% availability
reduce incident frequency and impact
drive rapid resolution when issues occur
and observability solutions
proactively detect issues and enhance the customer experience
evaluate and select the tools and technologies used to monitor
and measure the effectiveness of our production environments
Operate in a 24/7 follow-the-sun support model
Monitor and manage large-scale production compute and storage environments to ensure high availability and performance
and observability tools to proactively detect
and respond to incidents
Apply deep systems knowledge to analyze logs
and system behavior to diagnose issues
and implement effective resolutions
How You'll Work.
Team & Collaboration
collaborate with SRE, Security, and DevOps; partner with development teams; work successfully with multi-functional teams; coordinating effectively across organizational boundaries and geographies
Communication Scope
communication skills
Full Job Description
NVIDIA is seeking System Administrator/DevOps Engineers to help build and operate a global Service Reliability Operations Center supporting our Hardware Infrastructure. This role focuses on ensuring scalability, resilience, and near 100% availability. As part of this team, you will collaborate with SRE, Security, and DevOps to improve reliability, reduce incident frequency and impact, and drive rapid resolution when issues occur. You will partner with development teams to implement monitoring, alerting, and observability solutions that proactively detect issues and enhance the customer experience. You will also help evaluate and select the tools and technologies used to monitor, operate, and measure the effectiveness of our production environments. **What you will be doing:** * Operate in a 24/7 follow-the-sun support model spanning multiple continents, with direct reporting to a U.S.-based manager * Work a 4-day, 10-hour schedule, including either Saturday or Sunday, with flexible early or late shifts to ensure continuous global coverage across U.S. and India teams * Monitor and manage large-scale production compute and storage environments to ensure high availability and performance * Utilize alerts, alarms, and observability tools to proactively detect, prevent, and respond to incidents * Apply deep systems knowledge to analyze logs, metrics, and system behavior to diagnose issues, identify root causes, and implement effective resolutions **What we need to see:** * Highly motivated with strong communication skills, you have the ability to work successfully with multi-functional teams, principles, and architects, coordinating effectively across organizational boundaries and geographies. * 5+ years of experience administering large-scale production systems. * 3+ years of experience in high-availability Internet, Cloud, or Data Center environments (Systems Administration, SRE, or NOC). * BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experi
Applying for this Senior Service Reliability Engineer - EDA Infrastructure role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.