Lightning AI
AI
InfrastructureOperationsEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Infrastructure Operations Engineer at Lightning AI. Skills: GPU infrastructure, Linux systems, Automation, Reliability, Scalability. Scale and operate next-generation AI infrastructure platform. Own break/fix operations”
What You'll Achieve.
Minimize incidents; Enable customer facing and internal features; Improve operational efficiency; Build automation that reduces manual toil over time
Industry & Context.
Troubleshoot issues; Minimize incidents
On-call rotation, Occasional company and team offsites
What They're Looking For.
Must Have
8+ years working with Linux as a server / hosting platform, 5+ years experience with AWS, 2+ years experience with Kubernetes and container fundamentals, 2+ years experience with Terraform and Ansible, 2+ years with network attached storage management (via NFS, ceph, or other protocols), Experience with monitoring systems (Prometheus, ELK stack), Software development experience using Python, Go, bash, or other languages for the purposes of automation
Nice to Have
Ubuntu experience, Experience with VAST storage systems, Familiarity with the gitops workflow
What You'll Do.
Scale and operate next-generation AI infrastructure platform
Own break/fix operations
Customer provisioning
Work hands-on with large-scale GPU environments
bare metal infrastructure
provisioning workflows
and platform reliability
and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features
Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases
Participate in the on-call rotation
How You'll Work.
Team & Collaboration
Partner closely with Infrastructure Engineering, Network Operations, and Software Platform teams to troubleshoot issues, improve operational efficiency, and build automation; Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams
Full Job Description
Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute. Our Values Move Fast: We act with speed and precision, breaking down big challenges into achievable steps. Focus: We complete one goal at a time with care, collaborating as a team to deliver features with precision. Balance: Sustained performance comes from rest and recovery. We ensure a healthy work-life balance to keep you at your best. Craftsmanship: Innovation through excellence. Every detail matters, and we take pride in mastering our craft. Minimal: Simplicity drives our innovation. We eliminate complexity through discipline and focus on what truly matters. What We're Looking For Lightning AI is seeking an experienced Infrastructure Operations Engineers to help scale and operate our next-generation AI infrastructure platform. Our InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns break/fix operations, incident response, customer provisioning, observability, and the automation systems that keep complex infrastructure running efficiently. In this role, you’ll work hands-on with large-scale GPU environments, Linux systems, bare metal infrastructure, provisioning workflows, and platform reliability. You’ll partner closely with Infrastructure
Applying for this Infrastructure Operations Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Lightning AI?
Real rants from real employees. Read before you apply.