Lambda
Data Center Business
SeniorIncidentManager
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Incident Manager at Lambda. Skills: Incident management, Infrastructure operations, Cross-team coordination, Reliability engineering. Lead incident response. Coordinate rapid resolution”
What You'll Achieve.
Reduced Mean Time to Resolution; Improved cross-team incident coordination; High-quality post-incident reviews; High-quality corrective actions; Increased infrastructure reliability; Increased operational maturity
Industry & Context.
Root cause analysis; Incident analysis; Troubleshooting
On-Call Rotation
What They're Looking For.
Must Have
8+ years incident management, 8+ years site reliability engineering, 8+ years infrastructure operations, Manage incidents large-scale distributed infrastructure, Understand data center operations, Understand GPU compute clusters, Understand networking infrastructure, Understand storage infrastructure, Understand cloud hybrid infrastructure platforms, Lead high-pressure incident response, Experience incident management frameworks, Excellent communication skills, Excellent stakeholder management skills
Nice to Have
Operating AI infrastructure, Operating HPC infrastructure, SRE background, Infrastructure engineering background, Data center operations background, High-density GPU environments, NVIDIA clusters, InfiniBand networks, Hyperscale data center environments, Colocation data center environments, Automation incident response tooling, Incident command system (ICS), Leading incident command from scratch, Developing incident command from scratch
What You'll Do.
Lead incident response
Coordinate rapid resolution
Improve operational resilience
Drive incident management best practices
Lead end-to-end lifecycle incidents
Act as central command point
Ensure cross-team coordination
Ensure effective communication
Ensure structured post-incident analysis
Lead response critical incidents
Coordinate engineering teams
Coordinate networking teams
Coordinate facilities teams
Coordinate vendor teams
Act as liaison leadership
Provide updates status summaries
Establish clear incident timelines
Establish triage actions
Establish resolution plans
Assist technical triage
Manage post-incident review
Ensure timely communication
Ensure accurate communication
Maintain incident documentation
Maintain operational playbooks
Conduct analysis incidents
Identify patterns trends
Work in on-call rotation
Drive alignment outages
Lead post-incident reviews
Lead root cause analysis
Identify systemic reliability gaps
Implement corrective actions
Track incident metrics
Improve incident response processes
Improve escalation paths
Contribute to runbooks
Contribute operational standards
Contribute reliability frameworks
Support implementation automation
Support implementation observability
Provide executive summaries
Provide executive reports
Deliver clear updates
Deliver concise updates
Maintain incident dashboards
Maintain operational health reporting
How You'll Work.
Team & Collaboration
Cross-team coordination; Coordination engineering teams; Coordination networking teams; Coordination facilities teams; Coordination vendor teams; Coordination infrastructure layers; Work with technical support; Work with engineering teams
Communication Scope
Executive summaries; Executive reports; Incident updates; Status summaries; Clear updates; Concise updates
Process & Methodology
Incident management frameworks
Full Job Description
Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. We are seeking a Senior Incident Manager to lead critical incident response across our AI data center infrastructure. This role is responsible for coordinating rapid resolution of service-impacting events, improving operational resilience, and driving incident management best practices across infrastructure, networking, platform engineering, and data center operations. Role Overview The Senior Incident Manager is responsible for leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and data center services. This individual acts as the central command point during major incidents, ensuring rapid triage, cross-team coordination, effective communication, and structured post-incident analysis. This role requires deep operational expertise in high-availability infrastructure, large-scale GPU clusters, networking, and cloud platforms, along with strong leadership and communication skills. What You’ll Do Incident Leadership - Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations. - Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams. - Act as the liaison between leadership and external teams during incidents / post-incidents to provide updates and status summaries. - Establish clear incident timelines, triage actions, and resolution plans. Incident Management Operations - Own the incident response lifecycle including: - Assisting Technical Triage - Escalation - Coordination - Resolution Post-incident r
Applying for this Senior Incident Manager role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about Lambda?
Real rants from real employees. Read before you apply.