Nscale
AI Infrastructure
PrincipalBack-EndNetworkEngineer-AIInfrastructureOperations
Neural analysis suggests this role is
optimal for Principal candidates.
“Principal Back-End Network Engineer - AI Infrastructure Operations at Nscale. Skills: Infiniband, RoCE fabrics, Network operations, AI infrastructure. Own technical direction for AI interconnect networks. Own operational strategy for AI interconnect networks”
What You'll Achieve.
Improve fabric reliability; Improve performance predictability; Improve operational maturity; Reduce incident count
Industry & Context.
Troubleshoot network incidents; Resolve cross-layer issues
What They're Looking For.
Must Have
10+ years network engineering experience, Deep focus on HPC, AI, or hyperscale data centre networking, Expert-level operational and architectural experience with Infiniband and/or large-scale RoCE fabrics, Deep understanding of RDMA internals, Expertise in modern data centre routing and control planes, Proven ability to lead complex technical initiatives across teams without direct authority, Systems-level mindset
Nice to Have
Extensive experience with NVIDIA/Mellanox networking platforms, Deep familiarity with distributed training frameworks, Deep familiarity with GPU communication patterns, Experience designing network observability systems, Prior experience influencing platform or infrastructure strategy at scale
What You'll Do.
Own technical direction for AI interconnect networks
Own operational strategy for AI interconnect networks
Design Infiniband fabric architectures
Review Infiniband fabric architectures
Evolve Infiniband fabric architectures
Act as senior escalation point for network incidents
Guide deep technical investigations
Drive initiatives to improve fabric reliability
Drive initiatives to improve fabric performance predictability
Drive initiatives to improve operational maturity
Define standards for hardware configuration
Define standards for congestion control
Define standards for routing
Define standards for firmware lifecycle management
Define standards for change safety
Partner with SRE teams
Partner with Compute Platform teams
Partner with Network Architecture teams
Influence end-to-end system design
Mentor senior network engineers
Mentor mid-level network engineers
Raise bar for operational rigor
Raise bar for technical excellence
How You'll Work.
Team & Collaboration
Cross-team initiatives; Partner with SRE; Partner with Compute Platform; Partner with Network Architecture
Full Job Description
About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale enables AI-focused companies to achieve superior results by reducing the complexity of AI development. Our GPU cloud bolsters technical capabilities and directly supports strategic business outcomes, including cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency, where everyone is inspired to do their best work. If you join our team, you’ll be contributing to building the technology that powers the future. About The Role The Network Operations and Engineering teams at Nscale operate some of the most demanding networking environments in the industry, supporting tightly coupled GPU clusters where network performance directly impacts customer outcomes. We’re looking for a Principal Network Engineer – AI Infrastructure to provide technical leadership across Nscale’s high-speed networking domain. This role is focused on owning the reliability, scalability, and long-term evolution of our Infiniband and RDMA-based network fabrics. You will operate as a technical authority, influencing architecture, standards, and operational practices across teams while tackling the most complex network challenges in the platform. What You'll Be Doing Owning the technical direction and operational strategy for Nscale’s AI interconnect networks Designing, reviewing, and evolving large-scale Infiniband and RoCE fabric architectures to support future growth and workload demands Acting as the senior escalation point for the most complex network incidents, guiding deep technical investigations and systemic fixes Driving cross-team initiatives to improve fabric reliability, p
Applying for this Principal Back-End Network Engineer - AI Infrastructure Operations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Nscale?
Real rants from real employees. Read before you apply.