Company
Technology
HPCInfrastructureSiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“HPC Infrastructure Site Reliability Engineer. Skills: HPC infrastructure, AI infrastructure, Site Reliability Engineering, Distributed systems. Operate AI/HPC infrastructure. Improve AI/HPC infrastructure”
What You'll Achieve.
Ensuring reliability; Ensuring performance; Continuous improvement; Meet reliability expectations; Meet scalability expectations; Meet performance expectations; Strengthening observability; Improving reliability; Improving consistency; Improving operational efficiency
Industry & Context.
Advanced operational troubleshooting; Performance evaluation; Troubleshooting complex issues; Root cause analysis
24/7/365 on-call environment, 24x7 production environment
What They're Looking For.
Must Have
Senior Infrastructure SRE experience, Large-scale distributed systems experience, HPC and AI infrastructure expertise, Linux expertise, Distributed systems expertise, Bare metal experience, Networking experience, Storage experience, Virtualization experience, Orchestration experience, NVIDIA GPU ecosystems experience, RDMA networking experience, Performance validation experience, Benchmarking experience, TCP/IP, DNS, DHCP, VLANs, routing, switching knowledge, NVIDIA GPU ecosystems knowledge, Infrastructure automation and scripting experience, Observability principles understanding, Workload schedulers understanding, Parallel storage platform experience, ITIL-aligned environments experience, Troubleshooting skills in high-pressure environments, Incident ownership and resolution track record, Ability to collaborate with Platform SRE teams
Nice to Have
Deep experience with HPC workloads, GPU-accelerated infrastructure at scale experience, InfiniBand, RoCE, or HPC-grade networking fabrics experience, HPC benchmarking, validation, or performance testing experience, Kubernetes-based orchestration environments exposure, LPIC Certifications, ITIL Foundation level qualification
What You'll Do.
Operate AI/HPC infrastructure
Improve AI/HPC infrastructure
Participate in on-call rotation
Support mission-critical systems
Troubleshoot complex issues
Investigate cross-layer issues
Evaluate HPC infrastructure performance
Test HPC infrastructure
Accept HPC infrastructure operationally
Validate GPU infrastructure readiness
Support infrastructure deployment
Reduce operational toil
Improve operational efficiency
Strengthen observability
Refine operational workflows
Eliminate repetitive processes
Shape future infrastructure design
Influence infrastructure engineering decisions
Influence HPC platform evolution
How You'll Work.
Team & Collaboration
Cross-functional organisation; Network engineering; Infrastructure SRE; Platform SRE; Infrastructure tooling engineers; Data centre operations; Across engineering teams; Interface with non-technical stakeholders; Collaborate with Platform SRE teams
Communication Scope
Interface with stakeholders
Full Job Description
About Us We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable. Role Overview We are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations. The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems. This role provides exposure to the latest high-density AI compute platforms, including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential. Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads. A key responsibility is performance eval
Applying for this HPC Infrastructure Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.