Lightricks

AI-first company creating next-generation content creation technology

SoftwareEngineer(LargeScaleTraining)

Jerusalem, Jerusalem District, Israel

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Software Engineer (Large Scale Training) at Lightricks. Skills: large-scale model training, distributed training framework, performance optimization, distributed systems, Python, C++. Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily. Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute, memory, interconnect, storage, or the data pipeline”

What You'll Achieve.

make large-scale model training fast, reliable, and pleasant to work with; deliver expressive, high-fidelity video at unmatched speed; correctness, readability, testing, and long-term maintainability matter as much as the benchmark numbers

Industry & Context.

AI first company creating next generation content creation technology

Problems you'll solve

hard systems problems; squeezing throughput out of accelerator clusters; hunting down stragglers across hundreds of machines; designing abstractions that hold up as the codebase grows; making the unglamorous parts of training infrastructure work well; there's no reason this should take this slow / inefficient / hard to maintain / complex; performance work-profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter; Comfort with distributed systems: you've debugged things that only break at scale and have intuitions for where they tend to go wrong; A bias toward understanding systems end-to-end rather than treating any layer as a black box

What They're Looking For.

Must Have

software engineering fundamentals, clean, tested, maintainable Python, comfortable reading and writing modern C++, Real experience with performance work-profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter, Comfort with distributed systems, bias toward understanding systems end-to-end rather than treating any layer as a black box, Familiarity with Kubernetes or similar environments for running and scaling large workloads

Nice to Have

ML training experience is a bonus, Working knowledge of at least one accelerator architecture (GPU, TPU, or similar), or a clear track record of going deep on hardware when the problem calls for it, Experience with JAX/Pallas, Triton, CUDA, OpenCL, Metal, or similar accelerator programming, Prior exposure to ML training pipelines, even informally- pet projects count

What You'll Do.

Build and maintain the distributed training framework: orchestration

and the ergonomics researchers interact with daily

Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute

Own a shared codebase the team relies on: correctness

and long-term maintainability matter as much as the benchmark numbers

Work close to the metal where it pays off- write or integrate custom GPU kernels

tune collective communication

and exploit hardware features that off-the-shelf frameworks leave on the table

How You'll Work.

Team & Collaboration

Collaborate with researchers to translate model ideas into training code that runs efficiently, and flag when an architectural choice will be expensive before it ships; collaborative mindset

Full Job Description

Who we are Lightricks is an AI-first company creating next-generation content creation technology for businesses, enterprises, and studios with a mission to bridge the gap between imagination and creation. At our core is LTX-2, an open-source generative video model, built to deliver expressive, high-fidelity video at unmatched speed. It powers both our own products and a growing ecosystem of partners through API integration. The company is also known globally for pioneering consumer creativity through products like Facetune, one of the world's most recognized creative brands, which helped introduce AI-powered visual expression to hundreds of millions of users worldwide. We combine deep research, user-first design, and end-to-end execution from concept to final render to bring the future of expression to all. About the Role This is a software engineering role on an ML team. You'll own the systems that make large-scale model training fast, reliable, and pleasant to work with, the distributed training framework, the data pipelines feeding it, the performance characteristics of every step on the critical path, and the day-to-day developer experience for the researchers who depend on it. You don't need to come in as an ML expert. You do need to be a strong engineer who gets excited about hard systems problems: squeezing throughput out of accelerator clusters, hunting down stragglers across hundreds of machines, designing abstractions that hold up as the codebase grows, and making the unglamorous parts of training infrastructure work well. If you've ever looked at a large-scale system and thought "there's no reason this should take this slow / inefficient / hard to maintain / complex," this role is built for you. Key Responsibilities Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily. Profile end-to-end training runs and eliminate bottlenecks wherever they

Free ATS check

Applying for this Software Engineer (Large Scale Training) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 56 detected · ranked by frequency

large-scale model training ×5

distributed training framework ×5

distributed systems ×5

Python ×3

data pipelines ×3

performance characteristics ×3

developer experience ×3

squeezing throughput out of accelerator clusters ×3

hunting down stragglers across hundreds of machines ×3

designing abstractions ×3

orchestration ×3

checkpointing ×3

fault tolerance ×3

observability ×3

ergonomics researchers interact with daily ×3

Profile end-to-end training runs ×3

eliminate bottlenecks ×3

compute ×3

memory ×3

interconnect ×3

storage ×3

data pipeline ×3

Work close to the metal ×3

write or integrate custom GPU kernels ×3

tune collective communication ×3

exploit hardware features ×3

performance work-profiling ×3

optimization ×3

reasoning about systems where latency, throughput, and resource contention actually matter ×3

debugged things that only break at scale ×3

intuitions for where they tend to go wrong ×3

understanding systems end-to-end ×3

BEHAVIOURAL

gets excited about hard systems problemscurious about MLcollaborative mindsetthink, create, and exploreempowered to experiment, evolve, and elevate together

Role Details

Category tech

AI-Extracted Insights

Domain Areas

generative-video-modelml-training-pipelines

ANONYMOUS · UNFILTERED

What do employees actually say about Lightricks?

Real rants from real employees. Read before you apply.

Read Company Rants →