Company
Technology
InfrastructureTooling&ObservabilityEngineer
Neural analysis suggests this role is
optimal for Mid+ candidates.
“Infrastructure Tooling & Observability Engineer”
Industry & Context.
Full Job Description
ABOUT US We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable. ROLE OVERVIEW We are seeking an Infrastructure Tooling & Observability Engineer to act as a key engineering force within our global Infrastructure Operations organisation. Working closely with our SRE teams, you will translate high-level reliability objectives into scalable, production-ready systems that directly improve the resilience, efficiency, and performance of our global infrastructure. This role goes beyond traditional monitoring. You will help design and build the internal control plane that enables operations at scale across a rapidly growing GPU fleet. Your work will focus on transforming complex, high-volume telemetry—spanning logs, metrics, and events across HPC, networking, and platform layers—into actionable insight that drives operational excellence and proactive reliability. A core part of your responsibility will be developing intelligent observability and automation systems, including advanced alerting strategies, anomaly detection, and AI-driven tooling that reduces L1/L2 escalations and removes operational toil. You will also contribute to Continual Service Improvement (CSI) initiatives by building frameworks for reliability measurement, automated remediation, and system health evaluation. In addition, you will play a central role in turning SRE reliability initiatives into scalable engineering solutions. This includes designing and delivering capabilities such as inventory management systems, performance testing frameworks, and automated performance result collection. You will also help eliminate manual workflows involved in onboarding new regions, facilities, and clusters, embedding automation and standardisation into every stage of infrastructure
Applying for this Infrastructure Tooling & Observability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.