Amazon.com Services LLC
Robotics
SystemsDevelopmentEngineer,ResearchComputePlatform
Neural analysis suggests this role is
optimal for Mid+ candidates.
“Systems Development Engineer, Research Compute Platform at Amazon.com Services LLC. Skills: Research compute platform, ML infrastructure, Robotics software. Own on-prem GPU compute. Provision compute infrastructure”
What You'll Achieve.
Ensure scientist training runs work
Industry & Context.
Troubleshoot training issues; Diagnose hardware faults
What They're Looking For.
Must Have
3+ years Linux administration, 3+ years systems engineering, Configuration management experience, Fleet automation experience, Production containerization experience, Python, Go, or Bash proficiency, NVIDIA GPU infrastructure experience, Job scheduler experience, Hardware diagnosis and replacement
Nice to Have
NVIDIA DCGM fluency, NVLink / PCIe topology knowledge, IOMMU knowledge, Compute mode configuration knowledge, GPU cloud provider experience
What You'll Do.
Own on-prem GPU compute
Provision compute infrastructure
Manage driver and CUDA
Monitor compute infrastructure
Build job scheduling layer
Design on-prem/cloud bridge
Partner with ML scientists
Triage training issues
Advise on training structure
How You'll Work.
Team & Collaboration
Work with researchers; Partner with ML scientists
Full Job Description
We are seeking a Systems Development Engineer to own the research compute platform for Fauna Robotics. You will build and operate the physical and virtual infrastructure that our ML scientists use to train reinforcement learning policies for real robots, from fleet provisioning and job scheduling to cloud burst capacity and environment reproducibility. This role requires both strong systems engineering fundamentals and genuine comfort working alongside researchers. The ideal candidate is as happy diagnosing a GPU thermal fault as they are designing a job scheduler, and treats “the scientist’s training run just works” as the north star for everything they build. Key job responsibilities - Own on-prem GPU compute end-to-end: provisioning, imaging, driver and CUDA management, monitoring, failure diagnosis, hardware RMA, and capacity planning - Build and operate a job scheduling layer (Slurm, Ray, SkyPilot, or equivalent) so scientists submit training runs without managing individual machines - Design and implement the bridge between on-prem and cloud compute - Partner directly with ML scientists to triage training issues, profile workloads, identify bottlenecks, and advise on how to structure training for the hardware at hand About the team Fauna Robotics, an Amazon company, is building capable, safe, and genuinely delightful robots for everyday life. Our goal is simple: make robots people actually want to live and interact with in everyday human spaces. We believe that future won’t arrive until building for robotics becomes far more accessible. Today, too much effort is spent reinventing the fundamentals. We’re changing that by developing tightly integrated hardware and software systems that make it faster, safer, and more intuitive to create real-world robotic products. Our work spans the full stack: mechanical design, control systems, dynamic modeling, and intelligent software. The focus is not just functionality, but experience. We’re building robots that feel resp
Applying for this Systems Development Engineer, Research Compute Platform role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon.com Services LLC?
Real rants from real employees. Read before you apply.