Amazon Development Centre Canada ULC

Technology

MLKernelPerformanceEngineer

CA$80–192k Vancouver, British Columbia, Canada FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“ML Kernel Performance Engineer at Amazon Development Centre Canada ULC. Skills: ML Kernel Performance, CUDA, Triton, GPU Optimization. Design CUDA and Triton kernels. Implement kernels for quantization-aware training”

What You'll Achieve.

Achieve 20-100x compression; Accelerate training runs; Unlock deployment of compressed models; Translate algorithmic ideas into production-grade code

Industry & Context.

Technology
Problems you'll solve

Diagnose kernel bottlenecks; Fix kernel bottlenecks; Troubleshoot performance issues

What They're Looking For.

Must Have

3+ years professional software development, 2+ years system design/architecture, Experience with CUDA kernels or ML/low-level kernels, Experience developing/deploying LLMs on GPUs/TPUs, Experience with Python, Java, C++

Nice to Have

Bachelor's degree in computer science, 3+ years full software development lifecycle, Experience with GPU kernel optimization, Proficiency in low-level GPU optimization, Understanding GPU memory hierarchies, Experience developing high-performance libraries, Knowledge of ML frameworks, Experience implementing custom PyTorch operators, Experience with parallel programming, Background in neural network compression, Knowledge of mixed-precision training/inference, Experience with inference optimization, Familiarity with Transformer architectures, Experience with AWS Trainium/Inferentia, Experience with edge deployment

What You'll Do.

Design CUDA and Triton kernels

Implement kernels for quantization-aware training

Implement kernels for sparse matrix operations

Implement kernels for low-bit inference

Analyze kernel-level performance

Optimize kernel performance for compression training

Conduct detailed performance analysis

Identify and resolve bottlenecks

Implement kernel-level optimizations

Optimize operator fusion

Optimize memory access patterns

Build kernel development harness

Enable performance profiling

Enable accuracy testing

Enable validation at scale

Maintain training kernels library

Extend training kernels library

Collaborate on ML-centric solutions

Co-design software and hardware

Develop inference kernels

Build performance regression tests

Maintain benchmarking infrastructure

Translate algorithmic ideas into GPU code

How You'll Work.

Team & Collaboration

Compression scientists; Platform engineers; Applied Scientists; Compiler engineers; Hardware architects; Platform developers

Full Job Description

Amazon Devices is an inventive research and development company that designs and engineers high-profile consumer products like the Kindle family, Fire Tablets, Fire TV, Health & Wellness devices, Amazon Echo, and Astro. We are building the next generation of edge AI capabilities through our advanced compression platform and custom neural accelerator silicon. Within Edge AI & Science, the AI Platform team builds a compression platform—the first of its kind—enabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters, compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost. We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform, crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training, fine-tuning, and inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team, enabling scientists and engineers to profile, diagnose, and fix kernel bottlenecks without needing to be CUDA experts themselves. Working alongside compression scientists and platform engineers, you will ensure that novel quantization schemes (ternary, nonary, mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference. Key job responsibilities Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators Analyze and optimize kernel-level performance for compression training workloads, condu

Free ATS check

Applying for this ML Kernel Performance Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Amazon Development Centre Canada ULC?

Real rants from real employees. Read before you apply.

Read Company Rants →