Nscale

Technology

StaffObservabilityPlatformEngineer

$215–305k ~AI est. United States

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Staff candidates.

The Brief

“Staff Observability Platform Engineer at Nscale. Skills: Observability platforms, GPU infrastructure, AI infrastructure. Design observability platforms. Build observability platforms”

Industry & Context.

Technology

Problems you'll solve

Solving difficult infrastructure problems

What They're Looking For.

Must Have

6+ years experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines, Experience building and operating observability platforms in cloud-native, distributed environments, Deep hands-on experience with Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms, Software engineering skills with proficiency in Go, Python, or equivalent languages, Experience operating and troubleshooting Kubernetes-based platforms at scale, Understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices, Experience designing systems with scalability, reliability, performance, and operational simplicity in mind, Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent, Ability to lead technical initiatives and influence engineering decisions across multiple teams, Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions

Nice to Have

Operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments, Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms, Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit, Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems, Experience mentoring engineers and helping grow technical capability across teams

What You'll Do.

Design observability platforms

Build observability platforms

Evolve observability platforms

Lead implementation of scalable observability solutions

Embed observability throughout lifecycle

Drive improvements in monitoring coverage

Drive improvements in alert quality

Drive improvements in service health visibility

Drive improvements in incident response effectiveness

Develop standards for observability adoption

Develop frameworks for observability adoption

Develop reusable patterns for observability adoption

Identify reliability risks

Identify operational blind spots

Address risks proactively

Contribute to architectural decisions

Lead technical initiatives

Lead projects to improve platform scalability

Lead projects to improve platform reliability

Lead projects to improve platform operational efficiency

Provide technical guidance

Participate in incident investigations

Participate in postmortems

Translate operational learnings into platform improvements

Evaluate new observability technologies

Evaluate new observability practices

How You'll Work.

Team & Collaboration

Partnering across teams; Cross-functional teams; Engineering teams

Communication Scope

Explain technical tradeoffs; Align stakeholders

Process & Methodology

Technical initiatives

Full Job Description

About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency while contributing to the technology that powers the future. About the Role As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them. You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence. You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization. This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight. What You'll Do Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines. Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure. Par

Free ATS check

Applying for this Staff Observability Platform Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 32 detected · ranked by frequency

Telemetry pipelines ×3

Cardinality management ×3

Performance optimization ×3

Incident response ×3

Observability platforms ×2

GPU infrastructure ×2

AI infrastructure ×2

Prometheus ×2

Thanos ×2

VictoriaMetrics ×2

Grafana ×2

Loki ×2

Tempo ×2

OpenTelemetry ×2

ClickHouse ×2

Elastic ×2

Terraform ×2

Ansible ×2

Kafka ×2

Vector ×2

Fluent Bit ×2

Python

Kubernetes

Platform engineering

Infrastructure engineering

Observability engineering

System design

Architectural decisions

Engineering best practices

Operational efficiency

Developer experience

BEHAVIOURAL

LeadershipMentoring

Role Details

Experience 8–15 yrs

Level Staff

Category ai-infrastructure-operations

Salary Band 200k+

AI-Extracted Insights

Domain Areas

cloud-native-environmentsdistributed-systemsgpu-clustersai-workloadsai-mlhpclarge-scale-computemodel-training

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nscale?

Real rants from real employees. Read before you apply.

Read Company Rants →