Company

Engineering

HPCEngineer

$150–300k Sunnyvale, California, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“HPC Engineer. Skills: HPC, GPU clusters, Linux administration, Automation. Monitor health of GPU clusters. Monitor performance of GPU clusters”

What You'll Achieve.

Track infrastructure utilization; Track operational metrics

Industry & Context.

Engineering

Problems you'll solve

Troubleshooting

What They're Looking For.

Must Have

2+ years Linux systems administration, 2+ years SRE, 2+ years DevOps, 2+ years cloud operations, 2+ years HPC, 2+ years infrastructure operations, Linux troubleshooting skills, Experience with Python scripting, Experience with Bash scripting

Nice to Have

Slurm, GPU infrastructure, AWS, Azure, GCP, Grafana, Prometheus, Datadog, Containers, Kubernetes, AI/ML infrastructure exposure, Research computing environments

What You'll Do.

Monitor health of GPU clusters

Monitor performance of GPU clusters

Monitor availability of GPU clusters

Perform first-level triage

Troubleshoot job failures

Execute operational runbooks

Execute recovery procedures

Validate cluster deployments

Validate cluster upgrades

Validate cluster maintenance

Track infrastructure utilization

Track operational metrics

Develop automation tools

Develop monitoring tools

Contribute to documentation

Contribute to reporting

Full Job Description

## Responsibilities • Monitor health, performance, and availability of large-scale GPU clusters.• Respond to incidents and perform first-level triage.• Support researchers and troubleshoot job failures.• Execute operational runbooks and recovery procedures.• Validate cluster deployments, upgrades, and maintenance activities.• Track infrastructure utilization and operational metrics.• Develop automation and monitoring tools.• Contribute to documentation and reporting. ## Education Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines. ## Experience • 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.• Strong Linux troubleshooting skills.• Experience with scripting using Python or Bash. ## Preferred Qualifications • Slurm.• GPU infrastructure.• AWS, Azure, or GCP.• Grafana, Prometheus, Datadog, or similar tools.• Containers and Kubernetes.• AI/ML infrastructure exposure.• Research computing environments.

Free ATS check

Applying for this HPC Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

Lever uses a streamlined one-page form — apply in under 5 minutes.
LinkedIn import works well; review parsed data before submitting.
The cover letter field is optional but visible to reviewers — use it to differentiate.
Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →