Lightricks
AI-first company creating next-generation content creation technology
SoftwareEngineer(LargeScaleTraining)
Neural analysis suggests this role is
optimal for Mid+ candidates.
“Software Engineer (Large Scale Training) at Lightricks. Skills: large-scale model training, distributed training framework, performance optimization, distributed systems, Python, C++. Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily. Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute, memory, interconnect, storage, or the data pipeline”
What You'll Achieve.
make large-scale model training fast, reliable, and pleasant to work with; deliver expressive, high-fidelity video at unmatched speed; correctness, readability, testing, and long-term maintainability matter as much as the benchmark numbers
Industry & Context.
hard systems problems; squeezing throughput out of accelerator clusters; hunting down stragglers across hundreds of machines; designing abstractions that hold up as the codebase grows; making the unglamorous parts of training infrastructure work well; there's no reason this should take this slow / inefficient / hard to maintain / complex; performance work-profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter; Comfort with distributed systems: you've debugged things that only break at scale and have intuitions for where they tend to go wrong; A bias toward understanding systems end-to-end rather than treating any layer as a black box
What They're Looking For.
Must Have
software engineering fundamentals, clean, tested, maintainable Python, comfortable reading and writing modern C++, Real experience with performance work-profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter, Comfort with distributed systems, bias toward understanding systems end-to-end rather than treating any layer as a black box, Familiarity with Kubernetes or similar environments for running and scaling large workloads
Nice to Have
ML training experience is a bonus, Working knowledge of at least one accelerator architecture (GPU, TPU, or similar), or a clear track record of going deep on hardware when the problem calls for it, Experience with JAX/Pallas, Triton, CUDA, OpenCL, Metal, or similar accelerator programming, Prior exposure to ML training pipelines, even informally- pet projects count
What You'll Do.
Build and maintain the distributed training framework: orchestration
and the ergonomics researchers interact with daily
Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute
Own a shared codebase the team relies on: correctness
and long-term maintainability matter as much as the benchmark numbers
Work close to the metal where it pays off- write or integrate custom GPU kernels
tune collective communication
and exploit hardware features that off-the-shelf frameworks leave on the table
How You'll Work.
Team & Collaboration
Collaborate with researchers to translate model ideas into training code that runs efficiently, and flag when an architectural choice will be expensive before it ships; collaborative mindset
Full Job Description
Who we are Lightricks is an AI-first company creating next-generation content creation technology for businesses, enterprises, and studios with a mission to bridge the gap between imagination and creation. At our core is LTX-2, an open-source generative video model, built to deliver expressive, high-fidelity video at unmatched speed. It powers both our own products and a growing ecosystem of partners through API integration. The company is also known globally for pioneering consumer creativity through products like Facetune, one of the world's most recognized creative brands, which helped introduce AI-powered visual expression to hundreds of millions of users worldwide. We combine deep research, user-first design, and end-to-end execution from concept to final render to bring the future of expression to all. About the Role This is a software engineering role on an ML team. You'll own the systems that make large-scale model training fast, reliable, and pleasant to work with, the distributed training framework, the data pipelines feeding it, the performance characteristics of every step on the critical path, and the day-to-day developer experience for the researchers who depend on it. You don't need to come in as an ML expert. You do need to be a strong engineer who gets excited about hard systems problems: squeezing throughput out of accelerator clusters, hunting down stragglers across hundreds of machines, designing abstractions that hold up as the codebase grows, and making the unglamorous parts of training infrastructure work well. If you've ever looked at a large-scale system and thought "there's no reason this should take this slow / inefficient / hard to maintain / complex," this role is built for you. Key Responsibilities Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily. Profile end-to-end training runs and eliminate bottlenecks wherever they
Applying for this Software Engineer (Large Scale Training) role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Lightricks?
Real rants from real employees. Read before you apply.