Company

Technology

HPC/MLInfrastructureEngineer

$175–250k ~AI est. San Francisco, California, United States; Tokyo, Japan FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“HPC/ML Infrastructure Engineer. Skills: HPC infrastructure, ML infrastructure, System administration. Lead bringup on AI training cluster. Administer AI training cluster”

What You'll Achieve.

Train best anime image model

Industry & Context.

Technology
Problems you'll solve

Troubleshoot cluster issues

Eligibility Requirements

Work in San Francisco or Tokyo, Work on-site in Tokyo or San Francisco, Physical hardware in Bay Area

What They're Looking For.

Must Have

5+ years HPC infrastructure experience, Linux sysadmin skills, Familiar with modern HPC landscape, Experience with SLURM, Experience with parallel filesystems, Experience with networking, Experience with anime models training, Experience with ldap, Experience with dmesg, Experience with physical computers

Nice to Have

Experience with Slinky on K8s, Experience with Warewulf/MAAS/Ansible, Experience with WEKA/VAST/Ceph, Experience with Tailscale, Experience with Grafana/Prometheus stack, Experience with setting sticky bits

What You'll Do.

Lead bringup on AI training cluster

Administer AI training cluster

Operate AI training cluster

Serve as bridge between researchers and GPU machines

Ensure SLURM jobs are running

Ensure parallel filesystems are serving

Ensure network is transmitting

Ensure anime models are training

Manage cluster provisioning

Manage cluster filesystems

Manage cluster networking

Manage cluster monitoring

Set sticky bits on directories

How You'll Work.

Team & Collaboration

Small, fast-paced teams; Directly help AI researchers

Full Job Description

[https://app.ashbyhq.com/api/images/user-content/89c1442c-bd9c-46be-a910-20b49b5d9ffc/7832bc0a-28e2-4339-a413-1e13b747c3b5/hpc-admin-wide.png] We’re looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on is probably the largest anime AI training cluster in the world. You’ll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training. YOU MAY BE A GOOD FIT IF: YOU LOVE ANIME AND THE ANIME AESTHETIC. This probably one of the only jobs in the world where you will get to combine your love of anime and large-scale GPU systems. YOU’RE FAMILIAR WITH THE MODERN HPC SOFTWARE LANDSCAPE Once upon a time, our team could install SLURM on a few bare metal nodes and get away with it. Now the landscape has become unbelievable complex, with SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack. We’re looking for someone with relevant experience up and down the stack (and maybe a papercut or two to show for it!) AS WELL AS THE TRADITIONAL SYSADMIN LANDSCAPE Bringing up and managing cluster still requires good old linux sysadmin skills, including wrangling ldap, triaging dmesg, and setting sticky bits on directories for misbehaving users and tools. YOU'RE NOT AFRAID OF PHYSICAL COMPUTERS We’re building out edge datacenters and our CEO is still personally racking, stacking, and provisioning HGX-based nodes in our living room. Also his VLAN design sucks and he’s bad at fiber routing. Please send help. AND YOU'RE COMFORTABLE WORKING ON SMALL, FAST-PACED TEAMS. We currently have a very tiny research team, and you’ll be directly helping some of the AI researchers in the world train the best anime image model in the world. We also believe in

Free ATS check

Applying for this HPC/ML Infrastructure Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →