Nscale
Technology
StaffObservabilityPlatformEngineer
Neural analysis suggests this role is
optimal for Staff candidates.
“Staff Observability Platform Engineer at Nscale. Skills: Observability platforms, GPU infrastructure, AI infrastructure. Design observability platforms. Build observability platforms”
Industry & Context.
Solving difficult infrastructure problems
What They're Looking For.
Must Have
6+ years experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines, Experience building and operating observability platforms in cloud-native, distributed environments, Deep hands-on experience with Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms, Software engineering skills with proficiency in Go, Python, or equivalent languages, Experience operating and troubleshooting Kubernetes-based platforms at scale, Understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices, Experience designing systems with scalability, reliability, performance, and operational simplicity in mind, Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent, Ability to lead technical initiatives and influence engineering decisions across multiple teams, Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions
Nice to Have
Operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments, Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms, Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit, Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems, Experience mentoring engineers and helping grow technical capability across teams
What You'll Do.
Design observability platforms
Build observability platforms
Evolve observability platforms
Lead implementation of scalable observability solutions
Embed observability throughout lifecycle
Drive improvements in monitoring coverage
Drive improvements in alert quality
Drive improvements in service health visibility
Drive improvements in incident response effectiveness
Develop standards for observability adoption
Develop frameworks for observability adoption
Develop reusable patterns for observability adoption
Identify reliability risks
Identify operational blind spots
Address risks proactively
Contribute to architectural decisions
Lead technical initiatives
Lead projects to improve platform scalability
Lead projects to improve platform reliability
Lead projects to improve platform operational efficiency
Provide technical guidance
Participate in incident investigations
Participate in postmortems
Translate operational learnings into platform improvements
Evaluate new observability technologies
Evaluate new observability practices
How You'll Work.
Team & Collaboration
Partnering across teams; Cross-functional teams; Engineering teams
Communication Scope
Explain technical tradeoffs; Align stakeholders
Process & Methodology
Technical initiatives
Full Job Description
About Nscale Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility. We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency while contributing to the technology that powers the future. About the Role As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them. You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence. You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization. This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight. What You'll Do Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines. Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure. Par
Applying for this Staff Observability Platform Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Nscale?
Real rants from real employees. Read before you apply.