Amazon Web Services, Inc.

Technology

SystemDevelopmentEngineer,CloudAI/ML/storageserverteams

$149–201k Cupertino, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“System Development Engineer, Cloud AI/ML/storage server teams at Amazon Web Services, Inc.. Skills: System Development, Cloud AI/ML, Storage Servers, Automation. Lead development of automation software. Lead development of diagnostic tooling”

What You'll Achieve.

Zero-touch operations; Reduce reliance on manual triage

Industry & Context.

Technology
Problems you'll solve

Debugging; Troubleshooting; Root cause analysis; Problem decomposition

What They're Looking For.

Must Have

2+ years professional software development, 1+ year system design/architecture, 3+ years systems engineering experience, Knowledge of systems engineering fundamentals, Experience programming at least one modern language

Nice to Have

PowerShell experience preferred, Python experience preferred, Ruby experience preferred, Java experience preferred, Experience working in Agile environment, Familiarity building predictive failure detection, Familiarity with Linux kernel driver development, Familiarity with storage platforms, Familiarity with compute platforms, Familiarity with GPU/accelerator platforms, Familiarity with NVIDIA, Familiarity with BMC/IPMI, Familiarity with firmware, Familiarity with PCIe topology, Familiarity with NVLink, Familiarity with hardware diagnostics, Familiarity working with ODMs, Familiarity building zero-touch automation, Familiarity working in large-scale datacenter, Familiarity working in cloud environments, Track record of rapidly coming up to speed, Familiarity with hardware bring-up, Familiarity with validation, Familiarity with fleet-wide deployment, Familiarity with telemetry pipelines, Familiarity with anomaly detection, Familiarity with operational metrics

What You'll Do.

Lead development of automation software

Lead development of diagnostic tooling

Lead development of fleet health infrastructure

Build scalable systems

Build reliable systems

Keep compute fleet healthy

Keep storage fleet healthy

Detect issues without human intervention

Diagnose issues without human intervention

Resolve issues without human intervention

Solve complex architectural problems

Identify system deficiencies

Solve issues before customer impact

Decompose server problems

Lead delivery of tasks

Drive quality into designs

Drive reliability into designs

Meet tooling requirements

Meet diagnostics requirements

Meet automation requirements

Build automation infrastructure

Design predictive failure detection systems

Implement predictive failure detection systems

Drive toward zero-touch operations

Build automation for fault detection

Build automation for fault diagnosis

Build automation for fault remediation

Develop monitoring tools

Develop alerting systems

Provide real-time visibility

Define fleet health metrics

Track fleet health metrics

Debug complex system-level issues

Resolve complex system-level issues

Troubleshoot Linux boot failures

Troubleshoot Linux runtime failures

Perform root cause analysis

Correlate across firmware

Correlate across kernel

Correlate across driver

Correlate across physical layer

Isolate hardware faults

Build diagnostic tooling

Automate root cause identification

Reduce reliance on manual triage

Lead definition of software

Lead development of software

Lead definition of automation

Lead development of automation

Lead definition of enabling tools

Lead development of enabling tools

Design scalable system-level software

Build scalable system-level software

Develop device drivers

Maintain device drivers

Build automation solutions

Work with OS internals

Work with storage subsystems

Work with accelerator software stacks

Work with GPU software stacks

Build CI/CD pipelines

Manage CI/CD pipelines

Deploy CI/CD pipelines

Ensure new server hardware addresses functionality

Identify potential problems onboarding servers

Engage with ODMs on requirements

Engage with design partners on requirements

Contribute to server design

Improve server robustness

Improve server testability

Improve server diagnosability

Improve server reliability

Partner with datacenter operations teams

Close loop between field failures and design improvements

How You'll Work.

Team & Collaboration

Work across multiple teams; Work across organizations; Collaborate with SDEs; Collaborate with SDETs; Collaborate with Engineers; Collaborate with TPMs; Collaborate with Managers; Collaborate with Principals; Work with ODMs; Work with Design Partners; Work with internal HWEng teams; Work with internal customers; Partner with datacenter operations

Process & Methodology

Agile, Scrum

Full Job Description

We are seeking an experienced Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention. You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI). Key job responsibilities Fleet Health & Predictive Infrastructure - Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms - Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact - Drive toward zero-touch operations — bui

Free ATS check

Applying for this System Development Engineer, Cloud AI/ML/storage server teams role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon Web Services, Inc.?

Real rants from real employees. Read before you apply.

Read Company Rants →