Nscale
Technology
PrincipalObservabilityPlatformEngineer
Neural analysis suggests this role is
optimal for Principal candidates.
“Principal Observability Platform Engineer at Nscale. Skills: Observability platform, Platform engineering, Infrastructure engineering, AI/ML infrastructure. Own technical strategy for observability. Own architecture for observability”
What You'll Achieve.
Ensure platform scales ahead of business
Industry & Context.
Root cause analysis; Troubleshooting; Diagnose failures
What They're Looking For.
Must Have
8+ years in SRE, 8+ years in infrastructure engineering, 8+ years in platform engineering, 8+ years in observability-focused roles, Operated observability infrastructure at scale, Proficient in Python, Proficient in Go, Comfortable owning complex systems end to end, Infrastructure-as-Code is default, Influence without authority
Nice to Have
Familiarity with GPU infrastructure, Familiarity with HPC environments, Slurm familiarity, Experience with high-volume streaming pipelines, Background in AI/ML infrastructure observability, Prior experience defining observability strategy
What You'll Do.
Own technical strategy for observability
Own architecture for observability
Drive platform decisions
Identify systemic gaps
Design platforms that make failure visible
Design platforms that make failure fast to diagnose
Partner with SRE teams
Partner with infrastructure teams
Partner with AI/ML teams
Embed observability natively
Define standards for engineers
Define patterns for engineers
Mentor observability team
Technically grow observability team
Lead incident postmortems
Use postmortems for platform improvements
How You'll Work.
Team & Collaboration
Partner with SRE; Partner with infrastructure; Partner with AI/ML teams
Communication Scope
Explain tradeoffs clearly
Process & Methodology
Technical strategy, Architecture roadmap
Full Job Description
Principal Observability Platform Engineer – Nscale About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you’ll build trust through openness and transparency while contributing to the technology that powers the future. About the Role As a Principal/Staff Observability Platform Engineer, you'll own the technical direction of Nscale's observability platform: the systems that give us deep visibility into GPU clusters, AI workloads, and the infrastructure running them. You treat observability as a product and a discipline, not a tooling exercise. You'll set the architectural roadmap, raise the engineering bar across teams, and ensure our platform scales ahead of the business, not behind it. You understand that complexity is a cost. Solutions that require constant babysitting don't scale, and neither does operational burden. The platforms you build should be simple to operate, easy to understand, and self-evidently correct when something goes wrong. This isn't a "maintain and operate" role. It's a "define, build, and lead" role. What You'll Do Own the technical strategy and architecture for observability across metrics, logs, traces, and alerting at scale. Drive platform decisions that have multi-year impact: tooling, data models, ingestion patterns, retention, cardinality management. Identify systemic gaps before they become incidents; design platforms that make failure visible and fast to diagnose. Partner with SRE, infrastructure, and AI/ML teams to embed observability natively into how Nscale bui
Applying for this Principal Observability Platform Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Nscale?
Real rants from real employees. Read before you apply.