Amazon Data Services, Inc.
Technology
Sr.SystemDevelopmentEngineer,CloudAI/ML/storageserverteams
Neural analysis suggests this role is
optimal for Senior candidates.
“Sr. System Development Engineer, Cloud AI/ML/storage server teams at Amazon Data Services, Inc.. Skills: Systems Development, Automation, Cloud AI/ML, Storage Servers. Lead development of automation software. Lead development of diagnostic tooling”
What You'll Achieve.
Zero-touch operations; Detect issues without human intervention; Diagnose issues without human intervention; Resolve issues without human intervention; Reduce reliance on manual triage; Rapid deployment of code changes
Industry & Context.
Problem solving; Debugging; Troubleshooting; Root cause analysis; Diagnostic tooling
What They're Looking For.
Must Have
6+ years software development experience, 6+ years systems design experience, 6+ years software development experience, 6+ years operations experience, 6+ years process improvement experience, 6+ years designing systems experience, 6+ years architecting systems experience, 5+ years programming experience, Experience with Linux/Unix, Experience leading software solutions deployment
Nice to Have
Knowledge of engineering practices, Knowledge of coding standards, Knowledge of code reviews, Knowledge of source control management, Knowledge of build processes, Knowledge of testing, Knowledge of certification, Knowledge of livesite operations, Experience with BMC/IPMI, Experience with firmware, Experience with PCIe topology, Experience with NVLink, Experience with hardware diagnostics, Experience working with ODMs, Experience building zero-touch automation, Experience building self-healing automation, Experience working in large-scale datacenter, Experience working in cloud environments, Track record of rapidly coming up to speed, Experience with hardware bring-up, Experience with validation, Experience with fleet-wide deployment, Familiarity with telemetry pipelines, Familiarity with anomaly detection, Familiarity with operational metrics
What You'll Do.
Lead development of automation software
Lead development of diagnostic tooling
Lead development of fleet health infrastructure
Build scalable systems
Build reliable systems
Identify deficiencies
Collaborate with SDEs
Collaborate with SDETs
Collaborate with Hardware Engineers
Collaborate with TPMs
Collaborate with Managers
Collaborate with Principals
Work with Design Partners
Ensure tooling requirements are met
Ensure diagnostics requirements are met
Ensure automation requirements are met
Build automation infrastructure
Own automation infrastructure
Design predictive failure detection systems
Implement predictive failure detection systems
Identify hardware issues
Build automation for zero-touch operations
Detect hardware faults
Diagnose hardware faults
Remediate hardware faults
Develop monitoring tools
Develop alerting systems
Provide real-time visibility
Define fleet health metrics
Track fleet health metrics
Debug system-level issues
Resolve system-level issues
Troubleshoot Linux boot failures
Troubleshoot runtime failures
Perform root cause analysis
Correlate across firmware
Correlate across kernel
Correlate across driver
Correlate across physical layer
Build diagnostic tooling
Automate root cause identification
Reduce reliance on manual triage
Lead definition of software
Lead development of software
Lead definition of automation
Lead development of automation
Lead definition of enabling tools
Lead development of enabling tools
Design system-level software
Build system-level software
Develop device drivers
Maintain device drivers
Build automation solutions
Work with OS internals
Work with storage subsystems
Work with accelerator software stacks
Work with GPU software stacks
Build CI/CD pipelines
Manage CI/CD pipelines
Deploy CI/CD pipelines
Work across internal HWEng teams
Ensure new server hardware addresses functionality
Work closely with internal customers
Identify potential problems onboarding servers
Engage with design partners
Contribute to server design
Improve diagnosability
Partner with datacenter operations teams
Close the loop between field failures and design
Develop orchestration tooling
Perform hardware integration
Launch server hardware
Maintain server hardware
Work with internal development teams
Work with design partners
How You'll Work.
Team & Collaboration
Across multiple teams; Across organizations; With a variety of roles; With ODMs; With Design Partners; Internal HWEng teams; Internal customers; Datacenter operations teams
Process & Methodology
Track progress, Report progress
Full Job Description
Application deadline: May 11, 2026 We are seeking an experienced Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention. You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI). Key job responsibilities Fleet Health & Predictive Infrastructure - Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms - Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact - Drive
Applying for this Sr. System Development Engineer, Cloud AI/ML/storage server teams role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon Data Services, Inc.?
Real rants from real employees. Read before you apply.