Company
Engineering
HPCEngineer
Neural analysis suggests this role is
optimal for Mid+ candidates.
“HPC Engineer. Skills: HPC, GPU clusters, Linux administration, Automation. Monitor health of GPU clusters. Monitor performance of GPU clusters”
What You'll Achieve.
Track infrastructure utilization; Track operational metrics
Industry & Context.
Troubleshooting
What They're Looking For.
Must Have
2+ years Linux systems administration, 2+ years SRE, 2+ years DevOps, 2+ years cloud operations, 2+ years HPC, 2+ years infrastructure operations, Linux troubleshooting skills, Experience with Python scripting, Experience with Bash scripting
Nice to Have
Slurm, GPU infrastructure, AWS, Azure, GCP, Grafana, Prometheus, Datadog, Containers, Kubernetes, AI/ML infrastructure exposure, Research computing environments
What You'll Do.
Monitor health of GPU clusters
Monitor performance of GPU clusters
Monitor availability of GPU clusters
Perform first-level triage
Troubleshoot job failures
Execute operational runbooks
Execute recovery procedures
Validate cluster deployments
Validate cluster upgrades
Validate cluster maintenance
Track infrastructure utilization
Track operational metrics
Develop automation tools
Develop monitoring tools
Contribute to documentation
Contribute to reporting
Full Job Description
## Responsibilities • Monitor health, performance, and availability of large-scale GPU clusters.• Respond to incidents and perform first-level triage.• Support researchers and troubleshoot job failures.• Execute operational runbooks and recovery procedures.• Validate cluster deployments, upgrades, and maintenance activities.• Track infrastructure utilization and operational metrics.• Develop automation and monitoring tools.• Contribute to documentation and reporting. ## Education Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines. ## Experience • 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.• Strong Linux troubleshooting skills.• Experience with scripting using Python or Bash. ## Preferred Qualifications • Slurm.• GPU infrastructure.• AWS, Azure, or GCP.• Grafana, Prometheus, Datadog, or similar tools.• Containers and Kubernetes.• AI/ML infrastructure exposure.• Research computing environments.
Applying for this HPC Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.