Amazon.com Services LLC

Robotics

SystemsDevelopmentEngineer,ResearchComputePlatform

$142–192k New York, New York, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Systems Development Engineer, Research Compute Platform at Amazon.com Services LLC. Skills: Research compute platform, ML infrastructure, Robotics software. Own on-prem GPU compute. Provision compute infrastructure”

What You'll Achieve.

Ensure scientist training runs work

Industry & Context.

Robotics

Problems you'll solve

Troubleshoot training issues; Diagnose hardware faults

What They're Looking For.

Must Have

3+ years Linux administration, 3+ years systems engineering, Configuration management experience, Fleet automation experience, Production containerization experience, Python, Go, or Bash proficiency, NVIDIA GPU infrastructure experience, Job scheduler experience, Hardware diagnosis and replacement

Nice to Have

NVIDIA DCGM fluency, NVLink / PCIe topology knowledge, IOMMU knowledge, Compute mode configuration knowledge, GPU cloud provider experience

What You'll Do.

Own on-prem GPU compute

Provision compute infrastructure

Manage driver and CUDA

Monitor compute infrastructure

Build job scheduling layer

Design on-prem/cloud bridge

Partner with ML scientists

Triage training issues

Advise on training structure

How You'll Work.

Team & Collaboration

Work with researchers; Partner with ML scientists

Full Job Description

We are seeking a Systems Development Engineer to own the research compute platform for Fauna Robotics. You will build and operate the physical and virtual infrastructure that our ML scientists use to train reinforcement learning policies for real robots, from fleet provisioning and job scheduling to cloud burst capacity and environment reproducibility. This role requires both strong systems engineering fundamentals and genuine comfort working alongside researchers. The ideal candidate is as happy diagnosing a GPU thermal fault as they are designing a job scheduler, and treats “the scientist’s training run just works” as the north star for everything they build. Key job responsibilities - Own on-prem GPU compute end-to-end: provisioning, imaging, driver and CUDA management, monitoring, failure diagnosis, hardware RMA, and capacity planning - Build and operate a job scheduling layer (Slurm, Ray, SkyPilot, or equivalent) so scientists submit training runs without managing individual machines - Design and implement the bridge between on-prem and cloud compute - Partner directly with ML scientists to triage training issues, profile workloads, identify bottlenecks, and advise on how to structure training for the hardware at hand About the team Fauna Robotics, an Amazon company, is building capable, safe, and genuinely delightful robots for everyday life. Our goal is simple: make robots people actually want to live and interact with in everyday human spaces. We believe that future won’t arrive until building for robotics becomes far more accessible. Today, too much effort is spent reinventing the fundamentals. We’re changing that by developing tightly integrated hardware and software systems that make it faster, safer, and more intuitive to create real-world robotic products. Our work spans the full stack: mechanical design, control systems, dynamic modeling, and intelligent software. The focus is not just functionality, but experience. We’re building robots that feel resp

Free ATS check

Applying for this Systems Development Engineer, Research Compute Platform role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon.com Services LLC?

Real rants from real employees. Read before you apply.

Read Company Rants →