Nscale

Technology

PrincipalSiteReliabilityEngineer-AIInfrastructureOperations

$150–2150k AMER Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Principal Site Reliability Engineer - AI Infrastructure Operations at Nscale. Skills: Site Reliability Engineering, AI Infrastructure, Automation, Systems Design. Own reliability strategy. Evolve reliability strategy”

What You'll Achieve.

Improve availability; Reduce MTTR; Improve cost efficiency; Improve operational scalability

Industry & Context.

Technology

Problems you'll solve

Debugging; Troubleshooting; Root cause analysis

What They're Looking For.

Must Have

10+ years SRE/Systems/Software Engineering, Expert software engineering skills, Deep expertise Linux, Deep expertise networking, Deep expertise distributed systems, Extensive debugging experience, Lead technical initiatives without authority

Nice to Have

Deep hands-on AI/HPC platforms, Experience designing observability systems, Kubernetes at scale experience, Hybrid cloud architectures experience, Bare-metal cloud architectures experience, History of driving reliability improvements, History of driving scalability improvements, History of driving operational efficiency improvements

What You'll Do.

Own reliability strategy

Evolve reliability strategy

Design control-plane systems

Lead control-plane systems development

Design automation frameworks

Lead automation frameworks development

Design operational tooling

Lead operational tooling development

Define reliability standards

Define SLO frameworks

Define operational best practices

Act as technical escalation point

Guide incident resolution

Ensure systemic fixes

Identify reliability risks

Drive initiatives to address risks

Influence platform design

Influence operational maturity

Mentor senior engineers

Mentor mid-level engineers

How You'll Work.

Team & Collaboration

Partner with Engineering leadership; Partner with Network Operations leadership; Partner with Fleet Operations leadership; Cross-team improvements

Process & Methodology

Technical initiatives

Full Job Description

About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. About The Role At Nscale, our AI Infrastructure Operations team is responsible for the reliability and scalability of one of the most demanding AI platforms in the industry. We value engineers who think in systems, lead through influence, and raise the bar for operational excellence across the organisation. We’re looking for a Principal Site Reliability Engineer (SRE) to provide technical leadership across our AI Infrastructure Operations domain. This is a senior, highly impactful role focused on setting reliability strategy, designing foundational systems, and driving cross-team improvements at scale. You will operate as a technical authority for reliability, automation, and operational architecture across Nscale’s GPU, network, and control-plane platforms. What You'll Be Doing Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams Acting as a senior technical escalation point dur

Free ATS check

Applying for this Principal Site Reliability Engineer - AI Infrastructure Operations role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 24 detected · ranked by frequency

Automation ×6

Site Reliability Engineering ×5

Systems Engineering ×3

Software Engineering ×3

Debugging ×3

Troubleshooting ×3

AI Infrastructure ×2

Systems Design ×2

HPC

GPU

Kubernetes

Linux

Networking

Distributed systems

Systems thinking

Operational excellence

Reliability strategy

Operational tooling

Reliability standards

SLO frameworks

Platform design

Operational maturity

SLURM

BEHAVIOURAL

LeadershipMentoring

Role Details

Experience 5–10 yrs

Level Senior

Work Mode Remote

Category ai-infrastructure-operations

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

ai-infrastructuregpu-cloudhigh-speed-interconnectsinfinibandrdmaworkload-schedulershigh-cardinality-environmentshigh-throughput-environments

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nscale?

Real rants from real employees. Read before you apply.

Read Company Rants →