Crusoe
Tech / AI / Software
SeniorProductionEngineer,OeprationalExcellence
“Senior Production Engineer, Oeprational Excellence at Crusoe. Skills: Production Engineering, SRE, large-scale infrastructure operations, GPU workloads support, HPC environments support, Linux/Unix systems debugging, cloud infrastructure fundamentals, Kubernetes, distributed systems, monitoring and observability tools, infrastructure-as-code, scripting or programming. ensure the reliability, scalability, and performance of Crusoe’s GPU cloud. strengthen the operational foundation of Crusoe’s clo”
Industry & Context.
problem-solving; solving complex production problems; debugging complex issues; troubleshooting complex issues
Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments
What They're Looking For.
Must Have
5+ years of experience in Production Engineering, SRE, or large-scale infrastructure operations, Experience supporting GPU workloads, HPC environments, or latency/throughput-sensitive distributed systems, knowledge of Linux/Unix systems, including debugging complex issues across kernel and user space, Previous experience in Infrastructure roles building or managing compute, storage or networking platforms, Understanding of modern cloud infrastructure fundamentals including Kubernetes, distributed systems, virtualization, and cloud platforms (AWS/GCP), Familiarity with incident management practices and reliability frameworks (SRE, ITIL, or similar), Experience with monitoring and observability tools such as Prometheus and Grafana, or a desire to deepen expertise in this area, Familiarity with infrastructure-as-code and configuration management tools such as Terraform or Ansible, Scripting or programming experience with languages such as Go, Python, C, or C++, communication skills and the ability to collaborate across engineering teams, Ability to remain calm and effective while troubleshooting complex issues in high-impact production environments, A growth mindset and interest in reliability engineering, automation, and operational excellence
Nice to Have
Experience working with Kubernetes or container orchestration platforms at scale, Exposure to change management processes, operational readiness reviews, or structured root cause analysis, Experience designing self-healing systems, automated remediation, or event-driven operational tooling, Interest in scaling AI or HPC infrastructure and solving reliability challenges in GPU-heavy environments, Passion for mentorship, learning, and developing deeper expertise in Production Engineering
What You'll Do.
ensure the reliability
and performance of Crusoe’s GPU cloud
strengthen the operational foundation of Crusoe’s cloud
scale infrastructure that supports demanding AI and HPC workloads
improve system reliability
reduce operational toil
drive continuous improvements across Crusoe’s rapidly growing GPU cloud
define and evolve availability metrics for Crusoe’s cloud platform
and improve SLIs and SLOs
participate in production incident response
contribute to post-incident reviews and root cause analysis
and improve observability across Crusoe’s infrastructure
identify reliability risks
performance bottlenecks
and early indicators of potential production issues
develop automation and tooling that reduces operational toil
improves recovery times
and enables self-healing infrastructure
and platform teams to strengthen service resilience and disaster recovery capabilities
contribute to improving operational processes
and reliability best practices
How You'll Work.
Team & Collaboration
Collaborate with cross-functional teams; partner closely with Production Engineers, infrastructure teams, and platform engineers; partner with compute, networking, storage, and platform teams; collaborate across engineering teams
Communication Scope
communication skills and the ability to collaborate across engineering teams
Process & Methodology
incident management practices
Applying for this Senior Production Engineer, Oeprational Excellence role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about Crusoe?
Real rants from real employees. Read before you apply.