Company
SiteReliabilityEngineerIII
Neural analysis suggests this role is
optimal for Senior candidates.
“Site Reliability Engineer III. Skills: HPC Enablement, Systems design, reliability, performance. Own compute reliability roadmap. Define onboarding playbooks”
What You'll Achieve.
HPC job success rate improvement; Reduction in MTTR; Performance improvements; Time-to-onboard new scientific workloads; Improvement in cost-per-compute-hour efficiency; Reduction in operational toil
Industry & Context.
reduce recurring failures
What They're Looking For.
Must Have
BS+8 / MS+6 / PhD in CS/Engineering/Data disciplines, Demonstrated production delivery experience in HPC at scale, Demonstrated literacy in a relevant scientific domain
Nice to Have
Depth in HPC Enablement (HPC), Kubernetes, continuous integration/continuous delivery (CI/CD), observability, performance tuning, security-by-design, Evidence of standard‑setting and cross‑team mentoring experience
What You'll Do.
Own compute reliability roadmap
Define onboarding playbooks
Establish containerization standards
Optimize scheduler configuration
Conduct workload profiling
Lead incident response
Partner with scientific teams
How You'll Work.
Team & Collaboration
Collaborates with GCF6 Group Lead; partners with platform, data, ML, and research teams; Interfaces with governance; Interfaces with vendor/partner teams
Communication Scope
crisp written/verbal communication
Process & Methodology
road mapping, prioritization, outcome-oriented delivery, Prioritize pillar backlog
Full Job Description
## **Career Category** Engineering ## ## **Job Description** # Position Overview The GCF5 Site Reliability Engineer is the senior technical leader for the HPC Enablement pillar. They define and socialize operational standards and patterns, lead multi-team delivery, mentor GCF4 engineers, and translate researcher needs into scalable compute enablement designs. They own pillar-level reliability, performance, cost efficiency, and SLA/SLO outcomes, and influence cross-team engineering quality. This role reports to the GCF7 leader and partners closely with peer GCF5 domain leads across SCIP to ensure cohesive, scalable platform evolution. # Core Responsibilities * Own the compute reliability and enablement roadmap within SCIP. * Define onboarding playbooks and golden paths for HPC workloads. * Establish containerization and reproducible runtime standards. * Optimize scheduler configuration and resource allocation policies. * Conduct workload profiling and performance tuning. * Define and manage SLOs, reliability standards, and operational guardrails. * Lead incident response and reduce recurring failures. * Mentor engineers and elevate reliability practices. * Partner with scientific teams to translate compute requirements into scalable infrastructure patterns. # Core Competencies * Deep expertise in HPC Enablement (HPC) with evidence of standard‑setting and reuse. * Systems design at scale (HPC); performance, security, and observability fundamentals. * Product/engineering thinking: road mapping, prioritization, and outcome‑oriented delivery. * Stakeholder influence across science, engineering, and governance forums; crisp written/verbal communication. # Core Success Measures * HPC job success rate improvement. * Reduction in MTTR for compute incidents. * Performance improvements relative to baseline. * Time-to-onboard new scientific workloads. * Improvement in cost-per-compute-hour efficiency. * Reduction in operational toil via automation. # Key Relationships * Collabo
Applying for this Site Reliability Engineer III role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.