Lightning AI

InfrastructureOperationsEngineer

Itasca, Illinois, United States; United States Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Infrastructure Operations Engineer at Lightning AI. Skills: GPU infrastructure, Linux systems, Automation, Reliability, Scalability. Scale and operate next-generation AI infrastructure platform. Own break/fix operations”

What You'll Achieve.

Minimize incidents; Enable customer facing and internal features; Improve operational efficiency; Build automation that reduces manual toil over time

Industry & Context.

Problems you'll solve

Troubleshoot issues; Minimize incidents

Eligibility Requirements

On-call rotation, Occasional company and team offsites

What They're Looking For.

Must Have

8+ years working with Linux as a server / hosting platform, 5+ years experience with AWS, 2+ years experience with Kubernetes and container fundamentals, 2+ years experience with Terraform and Ansible, 2+ years with network attached storage management (via NFS, ceph, or other protocols), Experience with monitoring systems (Prometheus, ELK stack), Software development experience using Python, Go, bash, or other languages for the purposes of automation

Nice to Have

Ubuntu experience, Experience with VAST storage systems, Familiarity with the gitops workflow

What You'll Do.

Scale and operate next-generation AI infrastructure platform

Own break/fix operations

Customer provisioning

Work hands-on with large-scale GPU environments

bare metal infrastructure

provisioning workflows

and platform reliability

and roll out new platforms and patterns to minimize incidents and enable customer facing and internal features

Deploy updates and improvements to support both Voltage Park’s internal and end customer use cases

Participate in the on-call rotation

How You'll Work.

Team & Collaboration

Partner closely with Infrastructure Engineering, Network Operations, and Software Platform teams to troubleshoot issues, improve operational efficiency, and build automation; Collaborate with colleagues in Infrastructure Engineering, Network Operations, Customer Success and Software and Platform Development Teams

Full Job Description

Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute. Our Values Move Fast: We act with speed and precision, breaking down big challenges into achievable steps. Focus: We complete one goal at a time with care, collaborating as a team to deliver features with precision. Balance: Sustained performance comes from rest and recovery. We ensure a healthy work-life balance to keep you at your best. Craftsmanship: Innovation through excellence. Every detail matters, and we take pride in mastering our craft. Minimal: Simplicity drives our innovation. We eliminate complexity through discipline and focus on what truly matters. What We're Looking For Lightning AI is seeking an experienced Infrastructure Operations Engineers to help scale and operate our next-generation AI infrastructure platform. Our InfraOps team sits at the center of reliability, automation, and operational scale for GPU infrastructure. This team owns break/fix operations, incident response, customer provisioning, observability, and the automation systems that keep complex infrastructure running efficiently. In this role, you’ll work hands-on with large-scale GPU environments, Linux systems, bare metal infrastructure, provisioning workflows, and platform reliability. You’ll partner closely with Infrastructure

Free ATS check

Applying for this Infrastructure Operations Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 27 detected · ranked by frequency

GPU infrastructure ×3

Break/fix operations ×3

Incident response ×3

Automation systems ×3

Bare metal infrastructure ×3

Provisioning workflows ×3

Network attached storage management ×3

Linux systems ×2

Automation ×2

Reliability ×2

Scalability ×2

Linux ×2

AWS ×2

Kubernetes ×2

Terraform ×2

Ansible ×2

NFS ×2

ceph ×2

Prometheus ×2

ELK stack ×2

Python ×2

Go ×2

bash ×2

Platform reliability

Operational scale

Customer provisioning

Observability

BEHAVIOURAL

Collaboration

Role Details

Experience 8–10 yrs

Level Senior

Work Mode Remote

Category infrastructure

AI-Extracted Insights

Domain Areas

ai-systemsgpu-infrastructure

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Lightning AI?

Real rants from real employees. Read before you apply.

Read Company Rants →