Nscale
Technology
PrincipalSiteReliabilityEngineer-AIInfrastructureOperations
Neural analysis suggests this role is
optimal for Senior candidates.
“Principal Site Reliability Engineer - AI Infrastructure Operations at Nscale. Skills: Site Reliability Engineering, AI Infrastructure, Automation, Systems Design. Own reliability strategy. Evolve reliability strategy”
What You'll Achieve.
Improve availability; Reduce MTTR; Improve cost efficiency; Improve operational scalability
Industry & Context.
Debugging; Troubleshooting; Root cause analysis
What They're Looking For.
Must Have
10+ years SRE/Systems/Software Engineering, Expert software engineering skills, Deep expertise Linux, Deep expertise networking, Deep expertise distributed systems, Extensive debugging experience, Lead technical initiatives without authority
Nice to Have
Deep hands-on AI/HPC platforms, Experience designing observability systems, Kubernetes at scale experience, Hybrid cloud architectures experience, Bare-metal cloud architectures experience, History of driving reliability improvements, History of driving scalability improvements, History of driving operational efficiency improvements
What You'll Do.
Own reliability strategy
Evolve reliability strategy
Design control-plane systems
Lead control-plane systems development
Design automation frameworks
Lead automation frameworks development
Design operational tooling
Lead operational tooling development
Define reliability standards
Define SLO frameworks
Define operational best practices
Act as technical escalation point
Guide incident resolution
Ensure systemic fixes
Identify reliability risks
Drive initiatives to address risks
Influence platform design
Influence operational maturity
Mentor senior engineers
Mentor mid-level engineers
How You'll Work.
Team & Collaboration
Partner with Engineering leadership; Partner with Network Operations leadership; Partner with Fleet Operations leadership; Cross-team improvements
Process & Methodology
Technical initiatives
Full Job Description
About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. About The Role At Nscale, our AI Infrastructure Operations team is responsible for the reliability and scalability of one of the most demanding AI platforms in the industry. We value engineers who think in systems, lead through influence, and raise the bar for operational excellence across the organisation. We’re looking for a Principal Site Reliability Engineer (SRE) to provide technical leadership across our AI Infrastructure Operations domain. This is a senior, highly impactful role focused on setting reliability strategy, designing foundational systems, and driving cross-team improvements at scale. You will operate as a technical authority for reliability, automation, and operational architecture across Nscale’s GPU, network, and control-plane platforms. What You'll Be Doing Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams Acting as a senior technical escalation point dur
Applying for this Principal Site Reliability Engineer - AI Infrastructure Operations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Nscale?
Real rants from real employees. Read before you apply.