Crusoe

Tech / AI / Software

SeniorProductionEngineer,OeprationalExcellence

$172–209k san francisco, california, united states FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Production Engineer, Oeprational Excellence at Crusoe. Skills: Production Engineering, SRE, large-scale infrastructure operations, GPU workloads support, HPC environments support, Linux/Unix systems debugging, cloud infrastructure fundamentals, Kubernetes, distributed systems, monitoring and observability tools, infrastructure-as-code, scripting or programming. ensure the reliability, scalability, and performance of Crusoe’s GPU cloud. strengthen the operational foundation of Crusoe’s clo”

Industry & Context.

Tech / AI / Software
Problems you'll solve

problem-solving; solving complex production problems; debugging complex issues; troubleshooting complex issues

Eligibility Requirements

Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments

What They're Looking For.

Must Have

5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations, Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems, knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space, Previous experience in Infrastructure roles building or managing compute, storage or networking platforms, Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP), Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar), Experience with monitoring and observability tools such as Prometheus and Grafana, or a desire to deepen expertise in this area, Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible, Scripting or programming experience with languages such as Go, Python, C, or C++, communication skills and the ability to collaborate across engineering teams, Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments, A growth mindset and interest in reliability engineering, automation, and operational excellence

Nice to Have

Experience working with Kubernetes or container orchestration platforms at scale, Exposure to change management processes, operational readiness reviews, or structured root cause analysis, Experience designing self-healing systems, automated remediation, or event-driven operational tooling, Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments, Passion for mentorship, learning, and developing deeper expertise in Production Engineering

What You'll Do.

ensure the reliability

and performance of Crusoe’s GPU cloud

strengthen the operational foundation of Crusoe’s cloud

scale infrastructure that supports demanding AI and HPC workloads

improve system reliability

reduce operational toil

drive continuous improvements across Crusoe’s rapidly growing GPU cloud

define and evolve availability metrics for Crusoe’s cloud platform

and improve SLIs and SLOs

participate in production incident response

contribute to post-incident reviews and root cause analysis

and improve observability across Crusoe’s infrastructure

identify reliability risks

performance bottlenecks

and early indicators of potential production issues

develop automation and tooling that reduces operational toil

improves recovery times

and enables self-healing infrastructure

and platform teams to strengthen service resilience and disaster recovery capabilities

contribute to improving operational processes

and reliability best practices

How You'll Work.

Team & Collaboration

Collaborate with cross-functional teams; partner closely with Production Engineers, infrastructure teams, and platform engineers; partner with compute, networking, storage, and platform teams; collaborate across engineering teams

Communication Scope

communication skills and the ability to collaborate across engineering teams

Process & Methodology

incident management practices

Full Job Description

Crusoe is on a mission to accelerate the abundance of energy and intelligence. As the only vertically integrated AI infrastructure company built from the ground up, we own and operate each layer of the stack — from electrons to tokens — to power the world's most ambitious AI workloads. When you join Crusoe, you join a team that is building the future, faster. We're in the midst of the greatest industrial revolution of our time. The demand for AI compute is boundless, and power is a bottleneck. We're solving that — with an energy-first approach that makes AI infrastructure better for the world and faster for the people innovating with AI. We're looking for problem-solving, opportunity-finding teammates with a sense of urgency, who believe in the scale of our ambition and thrive on a path not fully paved — people who want to grow their careers alongside a team of experts across energy, manufacturing, data center construction, and cloud services. If you want to do the most meaningful work of your career, help our customers and partners advance their AI strategies, and be part of a high-performing team that believes in each other, come build with us at Crusoe. ABOUT THIS ROLE: Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platform — and Production Engineering sits at the heart of that mission. As a Production Engineer focused on Operational Excellence, you will help ensure the reliability, scalability, and performance of Crusoe’s GPU cloud that powers next-generation AI workloads. This role is ideal for engineers who enjoy solving complex production problems, improving large-scale distributed systems, and building automation that keeps infrastructure running smoothly. You’ll play a key role in strengthening the operational foundation of Crusoe’s cloud while helping scale infrastructure that supports demanding AI and HPC workloads. You’ll partner closely with Production Engineers, infrastructure teams, and platform engineers to improve system

Free ATS check

Applying for this Senior Production Engineer, Oeprational Excellence role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Crusoe?

Real rants from real employees. Read before you apply.

Read Company Rants →