Company

Technology

HPCInfrastructureSiteReliabilityEngineer

£75–110k ~AI est. United Kingdom FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“HPC Infrastructure Site Reliability Engineer. Skills: HPC infrastructure, AI infrastructure, Site Reliability Engineering, Distributed systems. Operate AI/HPC infrastructure. Improve AI/HPC infrastructure”

What You'll Achieve.

Ensuring reliability; Ensuring performance; Continuous improvement; Meet reliability expectations; Meet scalability expectations; Meet performance expectations; Strengthening observability; Improving reliability; Improving consistency; Improving operational efficiency

Industry & Context.

Technology
Problems you'll solve

Advanced operational troubleshooting; Performance evaluation; Troubleshooting complex issues; Root cause analysis

Eligibility Requirements

24/7/365 on-call environment, 24x7 production environment

What They're Looking For.

Must Have

Senior Infrastructure SRE experience, Large-scale distributed systems experience, HPC and AI infrastructure expertise, Linux expertise, Distributed systems expertise, Bare metal experience, Networking experience, Storage experience, Virtualization experience, Orchestration experience, NVIDIA GPU ecosystems experience, RDMA networking experience, Performance validation experience, Benchmarking experience, TCP/IP, DNS, DHCP, VLANs, routing, switching knowledge, NVIDIA GPU ecosystems knowledge, Infrastructure automation and scripting experience, Observability principles understanding, Workload schedulers understanding, Parallel storage platform experience, ITIL-aligned environments experience, Troubleshooting skills in high-pressure environments, Incident ownership and resolution track record, Ability to collaborate with Platform SRE teams

Nice to Have

Deep experience with HPC workloads, GPU-accelerated infrastructure at scale experience, InfiniBand, RoCE, or HPC-grade networking fabrics experience, HPC benchmarking, validation, or performance testing experience, Kubernetes-based orchestration environments exposure, LPIC Certifications, ITIL Foundation level qualification

What You'll Do.

Operate AI/HPC infrastructure

Improve AI/HPC infrastructure

Participate in on-call rotation

Support mission-critical systems

Troubleshoot complex issues

Investigate cross-layer issues

Evaluate HPC infrastructure performance

Test HPC infrastructure

Accept HPC infrastructure operationally

Validate GPU infrastructure readiness

Support infrastructure deployment

Reduce operational toil

Improve operational efficiency

Strengthen observability

Refine operational workflows

Eliminate repetitive processes

Shape future infrastructure design

Influence infrastructure engineering decisions

Influence HPC platform evolution

How You'll Work.

Team & Collaboration

Cross-functional organisation; Network engineering; Infrastructure SRE; Platform SRE; Infrastructure tooling engineers; Data centre operations; Across engineering teams; Interface with non-technical stakeholders; Collaborate with Platform SRE teams

Communication Scope

Interface with stakeholders

Full Job Description

About Us We’re a fast-growing GPU-as-a-Service provider, delivering scalable, high-performance compute infrastructure purpose-built for AI and HPC workloads. Operating across global data centres, we run mission-critical environments where uptime, throughput, and ultra-low latency are non-negotiable. Role Overview We are looking for a senior Infrastructure Site Reliability Engineer with deep experience operating large-scale distributed systems and recent hands-on expertise in high-performance computing (HPC) and AI infrastructure. This is an operations-first SRE role, working in a 24/7/365 on-call environment, responsible for ensuring reliability, performance, and continuous improvement of mission-critical infrastructure. This role sits within a cross-functional organisation spanning network engineering, infrastructure SRE, Platform SRE, infrastructure tooling engineers (software) and data centre operations. The ideal candidate has progressed through large-scale, globally distributed or multi-site infrastructure environments and has more recently specialised in GPU-accelerated HPC systems. This role provides exposure to the latest high-density AI compute platforms, including next-generation GPU infrastructure at significant scale. You will bring strong breadth across bare metal, networking, storage, virtualisation, and orchestration, alongside deep HPC experience including NVIDIA GPU ecosystems, RDMA networking (RoCE and InfiniBand), and performance validation and benchmarking. Strong Linux and distributed systems expertise is essential. Alongside operational ownership, this is a deeply technical Infrastructure SRE role centred on advanced operational troubleshooting and performance evaluation across large-scale HPC systems. You will investigate complex, cross-layer issues spanning GPU compute, networking, storage, and orchestration, building a clear understanding of system behaviour under real production AI and HPC workloads. A key responsibility is performance eval

Free ATS check

Applying for this HPC Infrastructure Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →