NVIDIA
Technology
SeniorSiteReliabilityEngineering-Storage
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability Engineering - Storage at NVIDIA. Skills: Site Reliability Engineering, Storage Systems, Automation, SRE Concepts. Lead design, deployment, operations of storage platforms. Capture requirements from partner teams”
What You'll Achieve.
Ensure reliability, performance, and security of storage platforms; Continuously improve reliability and efficiency
Industry & Context.
Troubleshoot complex storage and performance issues; Drive root cause analysis; Debug complex, distributed systems; Debug storage performance issues; Data-driven analysis
Participate in on-call and incident response, Follow-the-sun model collaboration
What They're Looking For.
Must Have
12+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering, significant focus on storage systems, hands-on experience with design, deployment, and operations of enterprise-grade NAS, SAN, and/or Object Storage platforms, Solid understanding of SRE concepts (SLOs/SLIs, error budgets, incident management, observability, postmortems), Proficiency with Infrastructure as Code and configuration management tools, Experience building and operating highly available, scalable infrastructure, Experience with container and virtualization platforms, scripting or programming skills (e.g., Python, Go, Shell), Excellent communication and collaboration skills, Bachelor’s degree in Computer Science, Computer Engineering, or a related technical field (or equivalent practical experience)
Nice to Have
Experience with storage for high-performance computing, AI/ML workloads, or large-scale data analytics, Proven ability to debug complex, distributed systems and storage performance issues, History of driving reliability improvements through data-driven analysis and automation, Experience leading technical initiatives, mentoring engineers, or acting as a technical lead on critical projects
What You'll Do.
operations of storage platforms
Capture requirements from partner teams
Architect storage solutions
Drive end-to-end implementation
Participate in on-call and incident response
Lead troubleshooting of complex issues
Drive root cause analysis
Define and track SLOs/SLIs
Build and maintain runbooks
Analyze capacity and usage trends
Recommend scaling or optimization strategies
Collaborate closely with SRE
Mentor junior engineers
Drive adoption of SRE principles
How You'll Work.
Team & Collaboration
Collaborate closely with SRE, infrastructure, networking, and application teams; Work effectively across distributed and cross-functional teams
Communication Scope
Excellent communication and collaboration skills
Process & Methodology
Lead technical initiatives, Act as a technical lead on critical projects
Full Job Description
NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s an outstanding legacy of innovation that’s fueled by phenomenal technology – and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world. We are seeking a Senior Site Reliability Engineer – Storage, you will own the reliability, performance, and scalability of our global NAS, SAN, and Object Storage platforms that power critical internal and external services. You will combine deep storage expertise with strong automation and SRE practices to design, build, and operate highly available storage systems at scale. **What You Will Be Doing:** * Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security. * Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services. * Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure. * Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions. * Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency. * Build and maintain runbooks, standard operating procedures, and comprehensive documentation for storage services and automation. * Analyze capacity and usage trends, perform forecasting, and recommend scaling or optimization strategies to
Applying for this Senior Site Reliability Engineering - Storage role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.