Sciforium

Technology

DistributedTrainingandInferenceEngineer

$190–250k San Francisco, California, United States FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Distributed Training and Inference Engineer at Sciforium. Skills: Distributed training, ML infrastructure, Systems engineering, Performance optimization. Maintain ML libraries. Update ML frameworks”

Industry & Context.

Technology

Problems you'll solve

Debugging complex issues; Troubleshooting; Performance analysis

What They're Looking For.

Must Have

5+ years industry experience, Bachelor's or Master's degree, Python, C++ programming, ML tooling familiarity, Distributed systems familiarity

Nice to Have

Extensive XLA/JAX stack experience, Familiarity with distributed serving, Familiarity with large-scale inference, GPU kernel optimization background, Accelerator-aware model partitioning background

What You'll Do.

Maintain ML libraries

Optimize ML libraries

Optimize ML frameworks

Build ML software stack

Maintain ML software stack

Improve ML software stack

Shard model implementations

Partition model implementations

Configure model implementations

Profile compilation graphs

Profile training workloads

Profile runtime execution

Eliminate performance bottlenecks

Troubleshoot hardware-software issues

Collaborate with research teams

Collaborate with infrastructure teams

Collaborate with kernel engineering teams

Improve system throughput

Improve system stability

Improve developer experience

How You'll Work.

Team & Collaboration

Collaborate with research; Collaborate with infrastructure; Collaborate with kernel engineering

Full Job Description

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications. ABOUT THE ROLE Sciforium is seeking a highly skilled Distributed Training and Inference Engineer to build, optimize, and maintain the critical software stack that powers our large-scale AI training and serving workloads. In this role, you will work across the entire machine learning infrastructure from low-level CUDA/ROCm runtimes to high-level frameworks like JAX and PyTorch to ensure our distributed training systems are fast, scalable, stable, and efficient. This position is ideal for someone who loves deep systems engineering, debugging complex hardware–software interactions, and optimizing performance at every layer of the ML stack. You will play a pivotal role in enabling the training and deployment of next-generation LLMs and generative AI models. WHAT YOU'LL DO - Software Stack Maintenance: Maintain, update, and optimize critical ML libraries and frameworks including JAX, PyTorch, CUDA, and ROCm across multiple environments and hardware configurations. - End-to-End Stack Ownership: Build, maintain, and continuously improve the entire ML software stack from ROCm/CUDA drivers to high-level JAX/PyTorch tooling. - Distributed System Optimization: Ensure all model implementations are efficiently sharded, partitioned, and configured for large-scale distributed training and serving. - System Integration: Continuously integrate and validate modules for runtime correctness, memory efficiency, and scalability across multi-node GPU/accelerator clusters. - Profiling & Performance Analysis: Conduct detailed profiling of compilation graphs, training workloads, and runtime execution to optimize performance and elimina

Free ATS check

Applying for this Distributed Training and Inference Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 25 detected · ranked by frequency

Distributed training ×5

ML infrastructure ×5

Systems engineering ×3

Performance optimization ×3

ML systems ×3

Large-scale AI training ×3

Large-scale AI serving ×3

Multi-node distributed training ×3

Partitioning configuration ×3

Python

JAX

PyTorch

CUDA

ROCm

NCCL

XLA

Systems integration

Debugging complex issues

Nsight

ROCm Profiler

XLA profiler

TPU tools

VLLM

TensorRT

FasterTransformer

Role Details

Seniority Senior

Experience 5–10 yrs

Level Senior

Type FULL TIME

Category software

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

multimodal-ai-modelsai-serving-platformfrontier-ai-modelsreal-time-applicationsllmsgenerative-ai-modelsml-stackcuda-runtimes

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Sciforium?

Real rants from real employees. Read before you apply.

Read Company Rants →