Amazon Web Services, Inc.
Technology
SystemDevelopmentEngineer,CloudAI/ML/storageserverteams
Neural analysis suggests this role is
optimal for Senior candidates.
“System Development Engineer, Cloud AI/ML/storage server teams at Amazon Web Services, Inc.. Skills: System Development, Cloud AI/ML, Storage Servers, Automation. Lead development of automation software. Lead development of diagnostic tooling”
What You'll Achieve.
Zero-touch operations; Reduce reliance on manual triage
Industry & Context.
Debugging; Troubleshooting; Root cause analysis; Problem decomposition
What They're Looking For.
Must Have
2+ years professional software development, 1+ year system design/architecture, 3+ years systems engineering experience, Knowledge of systems engineering fundamentals, Experience programming at least one modern language
Nice to Have
PowerShell experience preferred, Python experience preferred, Ruby experience preferred, Java experience preferred, Experience working in Agile environment, Familiarity building predictive failure detection, Familiarity with Linux kernel driver development, Familiarity with storage platforms, Familiarity with compute platforms, Familiarity with GPU/accelerator platforms, Familiarity with NVIDIA, Familiarity with BMC/IPMI, Familiarity with firmware, Familiarity with PCIe topology, Familiarity with NVLink, Familiarity with hardware diagnostics, Familiarity working with ODMs, Familiarity building zero-touch automation, Familiarity working in large-scale datacenter, Familiarity working in cloud environments, Track record of rapidly coming up to speed, Familiarity with hardware bring-up, Familiarity with validation, Familiarity with fleet-wide deployment, Familiarity with telemetry pipelines, Familiarity with anomaly detection, Familiarity with operational metrics
What You'll Do.
Lead development of automation software
Lead development of diagnostic tooling
Lead development of fleet health infrastructure
Build scalable systems
Build reliable systems
Keep compute fleet healthy
Keep storage fleet healthy
Detect issues without human intervention
Diagnose issues without human intervention
Resolve issues without human intervention
Solve complex architectural problems
Identify system deficiencies
Solve issues before customer impact
Decompose server problems
Lead delivery of tasks
Drive quality into designs
Drive reliability into designs
Meet tooling requirements
Meet diagnostics requirements
Meet automation requirements
Build automation infrastructure
Design predictive failure detection systems
Implement predictive failure detection systems
Drive toward zero-touch operations
Build automation for fault detection
Build automation for fault diagnosis
Build automation for fault remediation
Develop monitoring tools
Develop alerting systems
Provide real-time visibility
Define fleet health metrics
Track fleet health metrics
Debug complex system-level issues
Resolve complex system-level issues
Troubleshoot Linux boot failures
Troubleshoot Linux runtime failures
Perform root cause analysis
Correlate across firmware
Correlate across kernel
Correlate across driver
Correlate across physical layer
Isolate hardware faults
Build diagnostic tooling
Automate root cause identification
Reduce reliance on manual triage
Lead definition of software
Lead development of software
Lead definition of automation
Lead development of automation
Lead definition of enabling tools
Lead development of enabling tools
Design scalable system-level software
Build scalable system-level software
Develop device drivers
Maintain device drivers
Build automation solutions
Work with OS internals
Work with storage subsystems
Work with accelerator software stacks
Work with GPU software stacks
Build CI/CD pipelines
Manage CI/CD pipelines
Deploy CI/CD pipelines
Ensure new server hardware addresses functionality
Identify potential problems onboarding servers
Engage with ODMs on requirements
Engage with design partners on requirements
Contribute to server design
Improve server robustness
Improve server testability
Improve server diagnosability
Improve server reliability
Partner with datacenter operations teams
Close loop between field failures and design improvements
How You'll Work.
Team & Collaboration
Work across multiple teams; Work across organizations; Collaborate with SDEs; Collaborate with SDETs; Collaborate with Engineers; Collaborate with TPMs; Collaborate with Managers; Collaborate with Principals; Work with ODMs; Work with Design Partners; Work with internal HWEng teams; Work with internal customers; Partner with datacenter operations
Process & Methodology
Agile, Scrum
Full Job Description
We are seeking an experienced Systems Development Engineer to lead the development of automation software, diagnostic tooling, and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable, reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy — with a vision toward zero-touch operations where automation detects, diagnoses, and resolves issues without human intervention. You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your team's systems, proactively identify deficiencies, write scalable and robust code to solve issues before they impact customers. You will decompose large, difficult server testability, reliability, and diagnosis problems into straightforward tasks and components — leading delivery yourself and through others in parallel — using a combination of hardware, software, system design, processor architecture, diagnostics, and operations knowledge. You will collaborate with a variety of roles (SDEs, SDETs, Mechanical/Electrical/Hardware Engineers, TPMs, Managers, Principals) and organizations through server conception, test validation, qualification, launch, and operations — driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling, diagnostics, and automation requirements are met throughout the hardware development lifecycle (NPI). Key job responsibilities Fleet Health & Predictive Infrastructure - Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms - Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact - Drive toward zero-touch operations — bui
Applying for this System Development Engineer, Cloud AI/ML/storage server teams role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon Web Services, Inc.?
Real rants from real employees. Read before you apply.