Company

SiteReliabilityEngineerIII

India FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Site Reliability Engineer III. Skills: HPC Enablement, Systems design, reliability, performance. Own compute reliability roadmap. Define onboarding playbooks”

What You'll Achieve.

HPC job success rate improvement; Reduction in MTTR; Performance improvements; Time-to-onboard new scientific workloads; Improvement in cost-per-compute-hour efficiency; Reduction in operational toil

Industry & Context.

Problems you'll solve

reduce recurring failures

What They're Looking For.

Must Have

BS+8 / MS+6 / PhD in CS/Engineering/Data disciplines, Demonstrated production delivery experience in HPC at scale, Demonstrated literacy in a relevant scientific domain

Nice to Have

Depth in HPC Enablement (HPC), Kubernetes, continuous integration/continuous delivery (CI/CD), observability, performance tuning, security-by-design, Evidence of standard‑setting and cross‑team mentoring experience

What You'll Do.

Own compute reliability roadmap

Define onboarding playbooks

Establish containerization standards

Optimize scheduler configuration

Conduct workload profiling

Lead incident response

Partner with scientific teams

How You'll Work.

Team & Collaboration

Collaborates with GCF6 Group Lead; partners with platform, data, ML, and research teams; Interfaces with governance; Interfaces with vendor/partner teams

Communication Scope

crisp written/verbal communication

Process & Methodology

road mapping, prioritization, outcome-oriented delivery, Prioritize pillar backlog

Full Job Description

## **Career Category** Engineering ## ## **Job Description** # Position Overview The GCF5 Site Reliability Engineer is the senior technical leader for the HPC Enablement pillar. They define and socialize operational standards and patterns, lead multi-team delivery, mentor GCF4 engineers, and translate researcher needs into scalable compute enablement designs. They own pillar-level reliability, performance, cost efficiency, and SLA/SLO outcomes, and influence cross-team engineering quality. This role reports to the GCF7 leader and partners closely with peer GCF5 domain leads across SCIP to ensure cohesive, scalable platform evolution. # Core Responsibilities * Own the compute reliability and enablement roadmap within SCIP. * Define onboarding playbooks and golden paths for HPC workloads. * Establish containerization and reproducible runtime standards. * Optimize scheduler configuration and resource allocation policies. * Conduct workload profiling and performance tuning. * Define and manage SLOs, reliability standards, and operational guardrails. * Lead incident response and reduce recurring failures. * Mentor engineers and elevate reliability practices. * Partner with scientific teams to translate compute requirements into scalable infrastructure patterns. # Core Competencies * Deep expertise in HPC Enablement (HPC) with evidence of standard‑setting and reuse. * Systems design at scale (HPC); performance, security, and observability fundamentals. * Product/engineering thinking: road mapping, prioritization, and outcome‑oriented delivery. * Stakeholder influence across science, engineering, and governance forums; crisp written/verbal communication. # Core Success Measures * HPC job success rate improvement. * Reduction in MTTR for compute incidents. * Performance improvements relative to baseline. * Time-to-onboard new scientific workloads. * Improvement in cost-per-compute-hour efficiency. * Reduction in operational toil via automation. # Key Relationships * Collabo

Free ATS check

Applying for this Site Reliability Engineer III role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

  • Workday has a multi-step form — save your progress after every section.
  • "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
  • Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
  • Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →