Amazon.com Services LLC

Systems, Quality, Security Engineering, System Development Engineering, Cloud Computing

Sr.SystemDevelopmentEngineer,AL/ML/Storageserverteam

$90–235k Cupertino, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Sr. System Development Engineer, AL/ML/Storage server team at Amazon.com Services LLC. Skills: Automation, System development, Fleet health, AI/ML. Lead development of automation software. Lead development of diagnostic tooling”

What You'll Achieve.

Detect issues without human intervention; Diagnose issues without human intervention; Resolve issues without human intervention; Reduce reliance on manual triage; Improve manufacturing throughput; Improve yield

Industry & Context.

Systems, Quality, Security Engineering, System Development Engineering, Cloud Computing
Problems you'll solve

Problem solving; Debugging; Troubleshooting; Root cause analysis

What They're Looking For.

Must Have

6+ years software development experience, 6+ years systems design experience, 6+ years software development experience, 6+ years operations experience, 6+ years process improvement experience, 6+ years designing systems experience, 6+ years architecting systems experience, 5+ years programming experience, Experience with Linux/Unix, Experience leading design, Experience leading build, Experience leading deployment

Nice to Have

Knowledge of engineering practices, Knowledge of coding standards, Knowledge of code reviews, Knowledge of source control management, Experience building zero-touch automation, Experience building self-healing automation, Experience in large-scale datacenter environments, Experience in cloud environments, Track record of rapidly coming up to speed, Experience with hardware bring-up, Experience with fleet-wide deployment, Familiarity with telemetry pipelines, Familiarity with anomaly detection, Familiarity with operational metrics, Familiarity with manufacturing workflows, Familiarity with yield improvement optimization

What You'll Do.

Lead development of automation software

Lead development of diagnostic tooling

Lead development of fleet health infrastructure

Build scalable systems

Build reliable systems

Proactively identify deficiencies

Solve issues before customer impact

Decompose server problems

Lead delivery through others

Collaborate with roles and organizations

Drive quality into designs

Drive reliability into designs

Work with Design Partners

Ensure tooling requirements are met

Ensure diagnostics requirements are met

Ensure automation requirements are met

Build automation infrastructure

Own automation infrastructure

Design predictive failure detection systems

Implement predictive failure detection systems

Drive toward zero-touch operations

Build automation that detects faults

Build automation that diagnoses faults

Build automation that remediates faults

Develop monitoring tools

Develop alerting systems

Define fleet health metrics

Track fleet health metrics

Debug system-level issues

Resolve system-level issues

Troubleshoot Linux boot failures

Troubleshoot runtime failures

Perform root cause analysis

Correlate across firmware

Build diagnostic tooling

Automate root cause identification

Reduce reliance on manual triage

Improve manufacturing throughput

Lead definition of software

Lead development of software

Lead definition of automation

Lead development of automation

Lead definition of enabling tools

Lead development of enabling tools

Design system-level software

Build system-level software

Develop device drivers

Maintain device drivers

Build automation solutions

Work with OS internals

Work with storage subsystems

Work with accelerator software stacks

Work with GPU software stacks

Build CI/CD pipelines

Manage CI/CD pipelines

Deploy CI/CD pipelines

Work across internal HWEng teams

Ensure new hardware addresses functionality

Work closely with internal customers

Identify potential problems onboarding servers

Engage with design partners

Contribute to server design

Improve diagnosability

Partner with datacenter operations teams

Close the loop between field failures and design

How You'll Work.

Team & Collaboration

Across multiple teams; Across organizations; With ODMs; With Design Partners; Internal HWEng teams; Internal customers; Datacenter operations teams

Full Job Description

Application deadline: Jun 15, 2026 We are seeking an experienced Senior Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention. You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI). Key job responsibilities Fleet Health & Predictive Infrastructure - Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms - Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact

Free ATS check

Applying for this Sr. System Development Engineer, AL/ML/Storage server team role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon.com Services LLC?

Real rants from real employees. Read before you apply.

Read Company Rants →