Nscale

Technology

PrincipalSystemsEngineer

$175–225k Seattle, Washington, United States

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Principal candidates.

The Brief

“Principal Systems Engineer at Nscale. Skills: GPU Supercluster Bringup, Network Fabric Architecture, Performance Optimization, Deployment Strategy. Define technical standards. Lead GPU cluster deployments”

What You'll Achieve.

Superclusters brought online quickly; Superclusters brought online predictably; Superclusters brought online at peak performance; Deployment processes scale; Infrastructure becomes competitive advantage; Define technical blueprint

Industry & Context.

Technology

Problems you'll solve

Performance debugging; Root cause analysis

What They're Looking For.

Must Have

10+ years of experience, Large-scale infrastructure experience, HPC environments experience, Bringing up large GPU clusters, High-speed networking expertise, Server architecture understanding, Debugging performance issues, Automation experience, Systems-level thinking

Nice to Have

Scaling AI training clusters, Liquid cooling experience, Ultra-high-density deployments experience, Defining infrastructure standards

What You'll Do.

Define technical standards

Lead GPU cluster deployments

Architect network fabrics

Establish acceptance criteria

Establish validation frameworks

Tune and validate NCCL

Validate collective operations

Identify performance bottlenecks

Eliminate performance bottlenecks

Drive congestion control

Drive fabric optimization

Define performance benchmarking

Design repeatable deployment models

Build automation frameworks

Establish deployment SLAs

Establish quality gates

Establish operational readiness

Reduce time-to-capacity

Serve as escalation point

Mentor senior engineers

Shape infrastructure best practices

Influence hardware selection

Influence rack topology

Influence data center design

Partner on infrastructure strategy

How You'll Work.

Team & Collaboration

Cross-functional leadership; Partner with executive leadership

Full Job Description

Principal Systems Engineer – GPU Supercluster Bringup About Us We are building AI infrastructure for frontier-scale workloads. Our platform is designed for high-density, high-performance GPU clusters that push the limits of power, networking, and distributed compute. As a startup, we move fast, operate with ownership, and expect technical leaders to define standards—not just follow them. The Role We are hiring a Principal Deployment Engineer to architect and lead the bringup of large-scale GPU clusters (hundreds to thousands of GPUs). This is a technical leadership role responsible for defining how we deploy, validate, and scale AI superclusters across sites. You will own the full lifecycle of deployment—from rack design and fabric architecture to cluster validation frameworks and production readiness standards. You will set the bar for performance, reliability, and operational excellence. This role combines deep hands-on expertise with system-level thinking and cross-functional leadership. What You’ll Do End-to-End Supercluster Bringup Ownership Define the technical standards for node, rack, and full-cluster bringup. Lead large-scale GPU cluster deployments (multi-rack, multi-pod environments). Architect high-performance network fabrics (IB, RoCE, Ethernet) optimized for AI workloads. Establish cluster-level acceptance criteria and validation frameworks. Performance & Fabric Architecture Tune and validate NCCL, RDMA, GPUDirect, and collective operations at scale. Identify and eliminate performance bottlenecks across hardware, topology, and firmware layers. Drive congestion control and fabric optimization strategies. Define performance benchmarking methodology for AI training workloads. Deployment Strategy & Scalability Design repeatable deployment models for multi-site expansion. Build automation frameworks for provisioning and cluster validation. Establish deployment SLAs, quality gates, and operational readiness standards. Reduce time-to-capacity while increasing

Free ATS check

Applying for this Principal Systems Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 23 detected · ranked by frequency

Network Fabric Architecture ×5

Deployment Strategy ×3

GPU cluster bringup ×3

Cluster validation ×3

Performance benchmarking ×3

Deployment automation ×3

Systems debugging ×3

GPU Supercluster Bringup ×2

Performance Optimization ×2

InfiniBand

RoCE

Ethernet

PCIe

NUMA

NCCL

RDMA

GPUDirect

System design

Performance tuning

Congestion control

Benchmarking methodology

Scalability design

Infrastructure scaling

BEHAVIOURAL

LeadershipMentoring

Role Details

Experience 10–99 yrs

Level Principal

Work Mode Onsite

Category ai-infrastructure-operations

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

ai-infrastructurefrontier-scale-workloadsgpu-clustersdistributed-computehigh-density-deploymentshigh-performance-networkingai-workloadsai-training-workloads

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nscale?

Real rants from real employees. Read before you apply.

Read Company Rants →