Annapurna Labs

Technology

SoftwareEngineer-AI/ML,AWSNeuronDistributedTraining

$100–185k Cupertino, California, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Software Engineer - AI/ML, AWS Neuron Distributed Training at Annapurna Labs. Skills: Distributed training, AI/ML, AWS Neuron, Performance optimization. Contribute to design and implementation of distributed training. Extend and optimize distributed training frameworks”

What You'll Achieve.

Deliver cost-effective, performant machine learning solutions

Industry & Context.

Technology
Problems you'll solve

Performance optimization; Troubleshooting

What They're Looking For.

Must Have

Bachelor's degree or above in computer science, computer engineering, or related field, 1+ years of programming experience with at least one software programming language, Experience with software development practices, Experience with machine learning concepts, Experience with at least one ML framework

Nice to Have

Master's degree or above in computer science or equivalent, Experience with large-scale distributed training or LLM workloads, Experience with computer architecture or hardware-software co-optimization, Experience with distributed systems, libraries, or frameworks, Familiarity with end-to-end model training pipelines, Previous internship or research experience in ML infrastructure or systems software

What You'll Do.

Contribute to design and implementation of distributed training

Extend and optimize distributed training frameworks

Develop and optimize mixed-precision and low-precision training techniques

Implement precision-aware training strategies

Implement loss scaling techniques

Implement gradient management

and tune end-to-end training pipelines

Partner with hardware

Collaborate with AWS solution architects and customers

Support deployment and optimization of training workloads

How You'll Work.

Team & Collaboration

Chip architects; Compiler engineers; Runtime engineers; AWS solution architects; Hardware teams; Compiler teams; Runtime teams

Full Job Description

Annapurna Labs designs silicon and software that accelerates innovation. Our custom chips, accelerators, and software stacks enable us to tackle unprecedented technical challenges and deliver solutions that help customers change the world. AWS Neuron is the complete software stack powering AWS Trainium (Trn2/Trn3), our cloud scale Machine Learning accelerators and we are seeking a Senior Software Engineer to join our ML Distributed Training team. In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems. Key job responsibilities You will contribute to the design and implementation of distributed training solutions for large-scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP, torchtitan, and Hugging Face libraries for the Neuron ecosystem. A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16, FP8, and emerging numerical formats to improve training throughput while maintaining model accuracy and convergence quality. This includes implementing precision-aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats. Beyond precision optimization, you will profile, analyze, and tune end-to-en

Free ATS check

Applying for this Software Engineer - AI/ML, AWS Neuron Distributed Training role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Annapurna Labs?

Real rants from real employees. Read before you apply.

Read Company Rants →