Amazon.com Services LLC

Technology

Sr.SoftwareDevelopmentEngineer,MLOPs

$100–227k Bellevue, Washington, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Sr. Software Development Engineer, MLOPs at Amazon.com Services LLC. Skills: MLOps, Distributed systems, Machine Learning, Robotics. Design ML training infrastructure. Build ML training infrastructure”

Industry & Context.

Technology

Problems you'll solve

Root cause analysis; Troubleshooting

What They're Looking For.

Must Have

5+ years software development experience, 5+ years programming experience, 5+ years system design experience, Experience as a mentor, Experience as a tech lead, Experience leading engineering team

Nice to Have

5+ years full SDLC experience, Bachelor's degree in computer science, Knowledge of Machine Learning fundamentals, Knowledge of LLM fundamentals, Knowledge of transformer architecture, Knowledge of training lifecycles, Knowledge of inference lifecycles, Knowledge of optimization techniques

What You'll Do.

Design ML training infrastructure

Build ML training infrastructure

Operate ML training infrastructure

Implement scalable ML training

Build CI/CD pipelines

Develop tooling for tracking

Develop tooling for optimization

Develop tooling for reproducibility

Architect data pipelines

Ingest demonstration recordings

Operationalize ML models

Establish observability

Drive best practices for GPU management

Drive best practices for cost optimization

Drive best practices for capacity planning

Review training job health

Debug distributed training runs

Ship fixes to recovery system

Optimize imitation learning models

Plan experiment tracking platform

How You'll Work.

Team & Collaboration

Collaborate with research scientists; Pair with research scientist

Full Job Description

We are looking for a Senior Software Development Engineer with deep expertise in machine learning operations to join the Data & Intelligence Foundation (DIF) team within Amazon. You will design, build, and operate the ML training infrastructure that enables robot learning at scale — from distributed GPU training pipelines to experiment tracking, data management, and model deployment. Our team is building the foundational ML platform that powers autonomous robotics across Amazon’s fulfillment network. You’ll work at the intersection of large-scale distributed systems and cutting-edge ML research, turning novel vision-language-action models into production training workflows. Key job responsibilities - Design and implement scalable ML training infrastructure on Kubernetes (EKS) with GPU scheduling and fault-tolerant distributed training - Build and maintain CI/CD pipelines for ML models — from data ingestion through training, evaluation, and deployment - Develop tooling for experiment tracking, hyperparameter optimization, and reproducibility - Architect data pipelines that handle large-scale robotics datasets (telemetry, sensor recordings, demonstrations) - Collaborate with research scientists to operationalize novel ML models into production - Establish monitoring, alerting, and observability for training workloads and model performance - Drive best practices for GPU fleet management, cost optimization, and capacity planning A day in the life You’ll spend your mornings reviewing training job health across our GPU cluster, debugging a distributed training run that hit a node failure overnight, and shipping a fix to our checkpoint recovery system. After lunch, you’ll pair with a research scientist to optimize their new imitation learning model for multi-node training, then architect a new data pipeline to ingest demonstration recordings from robot workcells. You’ll close the day reviewing a PR from a teammate and planning the next iteration of our experiment tracking

Free ATS check

Applying for this Sr. Software Development Engineer, MLOPs role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 35 detected · ranked by frequency

MLOps ×3

Machine Learning ×3

Distributed training ×3

Experiment tracking ×3

Data management ×3

Model deployment ×3

Hyperparameter optimization ×3

Reproducibility ×3

Robotics datasets ×3

Sensor recordings ×3

Imitation learning ×3

Monitoring ×3

Alerting ×3

Observability ×3

Distributed systems ×2

Robotics ×2

Kubernetes ×2

EKS ×2

GPU scheduling

CI/CD

JAX

PyTorch

vLLM

SGLang

Dynamo

TorchXLA

TensorRT

System design

Reliability

Scaling

ML model deployment

Data pipelines

BEHAVIOURAL

Mentoring

Role Details

Experience 5–10 yrs

Level Senior

Work Mode Onsite

Type FULL TIME

Salary Band 100k-150k

AI-Extracted Insights

Domain Areas

machine-learningllm-fundamentalstransformer-architectureroboticsautonomous-roboticsvision-language-action-modelsindustrial-roboticsml-infrastructure-platform

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon.com Services LLC?

Real rants from real employees. Read before you apply.

Read Company Rants →