Amazon Development Centre Canada ULC
Technology
MLKernelPerformanceEngineer
Neural analysis suggests this role is
optimal for Mid+ candidates.
“ML Kernel Performance Engineer at Amazon Development Centre Canada ULC. Skills: ML Kernel Performance, CUDA, Triton, GPU Optimization. Design CUDA and Triton kernels. Implement kernels for quantization-aware training”
What You'll Achieve.
Achieve 20-100x compression; Accelerate training runs; Unlock deployment of compressed models; Translate algorithmic ideas into production-grade code
Industry & Context.
Diagnose kernel bottlenecks; Fix kernel bottlenecks; Troubleshoot performance issues
What They're Looking For.
Must Have
3+ years professional software development, 2+ years system design/architecture, Experience with CUDA kernels or ML/low-level kernels, Experience developing/deploying LLMs on GPUs/TPUs, Experience with Python, Java, C++
Nice to Have
Bachelor's degree in computer science, 3+ years full software development lifecycle, Experience with GPU kernel optimization, Proficiency in low-level GPU optimization, Understanding GPU memory hierarchies, Experience developing high-performance libraries, Knowledge of ML frameworks, Experience implementing custom PyTorch operators, Experience with parallel programming, Background in neural network compression, Knowledge of mixed-precision training/inference, Experience with inference optimization, Familiarity with Transformer architectures, Experience with AWS Trainium/Inferentia, Experience with edge deployment
What You'll Do.
Design CUDA and Triton kernels
Implement kernels for quantization-aware training
Implement kernels for sparse matrix operations
Implement kernels for low-bit inference
Analyze kernel-level performance
Optimize kernel performance for compression training
Conduct detailed performance analysis
Identify and resolve bottlenecks
Implement kernel-level optimizations
Optimize operator fusion
Optimize memory access patterns
Build kernel development harness
Enable performance profiling
Enable accuracy testing
Enable validation at scale
Maintain training kernels library
Extend training kernels library
Collaborate on ML-centric solutions
Co-design software and hardware
Develop inference kernels
Build performance regression tests
Maintain benchmarking infrastructure
Translate algorithmic ideas into GPU code
How You'll Work.
Team & Collaboration
Compression scientists; Platform engineers; Applied Scientists; Compiler engineers; Hardware architects; Platform developers
Full Job Description
Amazon Devices is an inventive research and development company that designs and engineers high-profile consumer products like the Kindle family, Fire Tablets, Fire TV, Health & Wellness devices, Amazon Echo, and Astro. We are building the next generation of edge AI capabilities through our advanced compression platform and custom neural accelerator silicon. Within Edge AI & Science, the AI Platform team builds a compression platform—the first of its kind—enabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters, compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost. We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform, crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training, fine-tuning, and inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team, enabling scientists and engineers to profile, diagnose, and fix kernel bottlenecks without needing to be CUDA experts themselves. Working alongside compression scientists and platform engineers, you will ensure that novel quantization schemes (ternary, nonary, mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference. Key job responsibilities Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators Analyze and optimize kernel-level performance for compression training workloads, condu
Applying for this ML Kernel Performance Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Amazon Development Centre Canada ULC?
Real rants from real employees. Read before you apply.