OpenAI

AI Research and Deployment

SoftwareEngineer,HardwareHealth

$250–445k San Francisco, California, United States FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Software Engineer, Hardware Health at OpenAI. Skills: Hardware Health, Observability, Distributed Systems, Infrastructure Engineering, Python, Shell Scripting, SQL, PromQL. Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure. Build and evolve health checks that detect, remediate, and verify failures at scale”

Industry & Context.

AI Research and Deployment
Problems you'll solve

systems debugging; Investigate hardware failures and system-level issues

What They're Looking For.

Must Have

7+ years of industry experience in software or infrastructure engineering, proficiency with Python and shell scripting, Experience building large-scale distributed systems or infrastructure platforms, Comfort digging into noisy operational data using SQL, PromQL, or similar tooling, Experience building reproducible analyses and operational tooling, systems debugging and operational instincts with an ownership mindset

Nice to Have

Experience with low-level hardware systems and Linux tooling (e. g. PCIe, InfiniBand, RoCE, networking, power management, kernel performance tuning, FW/SW debugging), Experience operating or debugging large-scale GPU or accelerator clusters, Expertise in network operations, observability, or systems telemetry, Experience with automated remediation systems or fleet lifecycle management, Experience improving reliability, utilization, or workload uptime in distributed compute environments

What You'll Do.

Define and maintain health signals across GPUs

and platform infrastructure

Build and evolve health checks that detect

and verify failures at scale

Ensure critical health checks execute with minimal latency to maximize workload uptime

Investigate hardware failures and system-level issues across large-scale compute environments

Own node lifecycle workflows including drain

and return-to-service processes

Build automation and tooling that enables global cluster management with minimal manual intervention

Partner with workload

and provider teams to integrate health signals into training and inference systems

How You'll Work.

Team & Collaboration

Partner with workload, reliability, and provider teams to integrate health signals into training and inference systems

Full Job Description

ABOUT THE TEAM The Hardware Health and Observability team owns the end-to-end health lifecycle of OpenAI’s global compute fleet. Our mission is to maximize healthy, usable compute across accelerator vendors, generations, cloud providers, and regions through reliable health signals, automated remediation, and scalable operational tooling. We build the systems that observe, detect, remediate, and verify hardware issues across GPUs, CPUs, networking, and platform infrastructure, enabling frontier model training and inference workloads to run reliably at hyperscale. We are the last line of defense for the success of OAI’s production and research workloads. ABOUT THE ROLE On the Hardware Health and Observability team, you’ll build critical infrastructure that keeps OpenAI’s largest compute clusters healthy and operational at scale. Even small numbers of unhealthy systems can impact large-scale training and inference workloads. This team focuses on minimizing downtime, improving fleet efficiency, and ensuring compute resources remain continuously available to researchers and product teams. Engineers on this team own problems end-to-end, from defining health signals and debugging failures to building automated remediation systems that operate across millions of GPUs globally. In this role, you will: - Define and maintain health signals across GPUs, CPUs, networking, and platform infrastructure. - Build and evolve health checks that detect, remediate, and verify failures at scale. - Ensure critical health checks execute with minimal latency to maximize workload uptime. - Investigate hardware failures and system-level issues across large-scale compute environments. - Own node lifecycle workflows including drain, quarantine, repair, RMA, and return-to-service processes. - Build automation and tooling that enables global cluster management with minimal manual intervention. - Partner with workload, reliability, and provider teams to integrate health signals into training and in

Free ATS check

Applying for this Software Engineer, Hardware Health role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about OpenAI?

Real rants from real employees. Read before you apply.

Read Company Rants →