Amazon.com Services LLC
Technology
Sr.SoftwareDevelopmentEngineer,MLOPs
Neural analysis suggests this role is
optimal for Senior candidates.
“Sr. Software Development Engineer, MLOPs at Amazon.com Services LLC. Skills: MLOps, Distributed systems, Machine Learning, Robotics. Design ML training infrastructure. Build ML training infrastructure”
Industry & Context.
Root cause analysis; Troubleshooting
What They're Looking For.
Must Have
5+ years software development experience, 5+ years programming experience, 5+ years system design experience, Experience as a mentor, Experience as a tech lead, Experience leading engineering team
Nice to Have
5+ years full SDLC experience, Bachelor's degree in computer science, Knowledge of Machine Learning fundamentals, Knowledge of LLM fundamentals, Knowledge of transformer architecture, Knowledge of training lifecycles, Knowledge of inference lifecycles, Knowledge of optimization techniques
What You'll Do.
Design ML training infrastructure
Build ML training infrastructure
Operate ML training infrastructure
Implement scalable ML training
Build CI/CD pipelines
Develop tooling for tracking
Develop tooling for optimization
Develop tooling for reproducibility
Architect data pipelines
Ingest demonstration recordings
Operationalize ML models
Establish observability
Drive best practices for GPU management
Drive best practices for cost optimization
Drive best practices for capacity planning
Review training job health
Debug distributed training runs
Ship fixes to recovery system
Optimize imitation learning models
Plan experiment tracking platform
How You'll Work.
Team & Collaboration
Collaborate with research scientists; Pair with research scientist
Full Job Description
We are looking for a Senior Software Development Engineer with deep expertise in machine learning operations to join the Data & Intelligence Foundation (DIF) team within Amazon. You will design, build, and operate the ML training infrastructure that enables robot learning at scale — from distributed GPU training pipelines to experiment tracking, data management, and model deployment. Our team is building the foundational ML platform that powers autonomous robotics across Amazon’s fulfillment network. You’ll work at the intersection of large-scale distributed systems and cutting-edge ML research, turning novel vision-language-action models into production training workflows. Key job responsibilities - Design and implement scalable ML training infrastructure on Kubernetes (EKS) with GPU scheduling and fault-tolerant distributed training - Build and maintain CI/CD pipelines for ML models — from data ingestion through training, evaluation, and deployment - Develop tooling for experiment tracking, hyperparameter optimization, and reproducibility - Architect data pipelines that handle large-scale robotics datasets (telemetry, sensor recordings, demonstrations) - Collaborate with research scientists to operationalize novel ML models into production - Establish monitoring, alerting, and observability for training workloads and model performance - Drive best practices for GPU fleet management, cost optimization, and capacity planning A day in the life You’ll spend your mornings reviewing training job health across our GPU cluster, debugging a distributed training run that hit a node failure overnight, and shipping a fix to our checkpoint recovery system. After lunch, you’ll pair with a research scientist to optimize their new imitation learning model for multi-node training, then architect a new data pipeline to ingest demonstration recordings from robot workcells. You’ll close the day reviewing a PR from a teammate and planning the next iteration of our experiment tracking
Applying for this Sr. Software Development Engineer, MLOPs role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon.com Services LLC?
Real rants from real employees. Read before you apply.