Lambda
Data Center Business
SeniorIncidentManager
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Incident Manager at Lambda. Skills: Incident management, Infrastructure operations, Site reliability. Lead critical incident response. Coordinate rapid resolution”
What You'll Achieve.
Reduced Mean Time to Resolution; Improved cross-team incident coordination; High-quality post-incident reviews; High-quality corrective actions; Increased infrastructure reliability; Increased operational maturity
Industry & Context.
Root cause analysis; Incident analysis; Pattern identification; Trend identification
On-Call Rotation
What They're Looking For.
Must Have
8+ years incident management, 8+ years site reliability engineering, 8+ years infrastructure operations, Manage incidents large-scale distributed infrastructure, Understand data center operations, Understand GPU compute clusters, Understand networking infrastructure, Understand storage infrastructure, Understand cloud hybrid infrastructure platforms, Lead high-pressure incident response, Experience incident management frameworks, Excellent communication skills, Excellent stakeholder management skills
Nice to Have
Operating AI infrastructure, Operating HPC infrastructure, SRE background, Infrastructure engineering background, Data center operations background, High-density GPU environments, NVIDIA clusters, InfiniBand networks, Hyperscale data center environments, Colocation data center environments, Automation incident response tooling, Incident command system (ICS), Leading incident command from scratch, Developing incident command from scratch
What You'll Do.
Lead critical incident response
Coordinate rapid resolution
Improve operational resilience
Drive incident management best practices
Lead end-to-end lifecycle operational incidents
Act as central command point
Ensure cross-team coordination
Ensure effective communication
Ensure structured post-incident analysis
Lead response critical incidents
Coordinate engineering teams
Coordinate networking teams
Coordinate facilities teams
Coordinate vendor teams
Act as liaison leadership
Provide updates leadership
Provide status summaries leadership
Establish clear incident timelines
Establish triage actions
Establish resolution plans
Own incident response lifecycle
Assist technical triage
Ensure timely communication
Ensure accurate communication
Maintain incident response documentation
Maintain operational playbooks
Conduct analysis incidents
Identify patterns incidents
Identify trends incidents
Work on-call rotation
Work closely data center operations
Work closely infrastructure engineering
Work closely infrastructure operations
Work closely network engineering
Work closely platform reliability engineering
Work closely security operations
Work closely hardware vendors
Work closely facility vendors
Drive alignment outages
Lead post-incident reviews
Lead root cause analysis
Identify systemic reliability gaps
Implement corrective actions
Track incident metrics
Improve incident response processes
Improve escalation paths
Contribute to runbooks
Contribute operational standards
Contribute reliability frameworks
Support implementation automation
Support implementation observability improvements
Provide executive-level summaries
Provide executive-level reports
Deliver clear updates
Deliver concise updates
Maintain incident dashboards
Maintain operational health reporting
How You'll Work.
Team & Collaboration
Cross-team coordination; Coordination infrastructure layers; Coordination multiple teams
Communication Scope
Executive summaries; Incident updates; Status summaries
Full Job Description
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. We are seeking a Senior Incident Manager to lead critical incident response across our AI data center infrastructure. This role is responsible for coordinating rapid resolution of service-impacting events, improving operational resilience, and driving incident management best practices across infrastructure, networking, platform engineering, and data center operations. Role Overview The Senior Incident Manager is responsible for leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. This individual acts as the central command point during major incidents, ensuring rapid triage, cross-team coordination, effective communication, and structured post-incident analysis. This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills. What You’ll Do Incident Leadership - Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations. - Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams. - Act as the liaison between leadership and external teams during incidents / post-incidents to provide updates and status summaries. - Establish clear incident timelines, triage actions, and resolution plans. Incident Management Operations - Own the incident response lifecycle including: - Assisting Technical Triage - Escalation - Coordination - Resolution Post-incident r
Applying for this Senior Incident Manager role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about Lambda?
Real rants from real employees. Read before you apply.