TensorWave
AI compute
SeniorInfrastructureSoftwareEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Infrastructure Software Engineer at TensorWave. Skills: Infrastructure Software Engineering, Automation, Kubernetes, GPU Clusters, Go. Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production. Automate Slurm and Kubernetes cluster lifecycle—bootstrapping, upgrades, node provisioning, and decommissioning at scale”
What You'll Achieve.
deliver seamless, secure, reliable, and resilient AI compute at scale; eliminate infrastructure barriers; bring hundreds of GPU nodes online reliably and repeatably; support business objectives; upholding our standards for excellence, collaboration, and impact
Industry & Context.
What They're Looking For.
Must Have
5+ years in infrastructure engineering or platform engineering, 3+ years writing production Go, Deep understanding of Kubernetes internals, Experience building Kubernetes Operators, Experience building gRPC and REST APIs in Go at production scale, Familiarity with bare metal infrastructure concepts, including PXE, IPMI, and BMC, testing discipline across unit, integration, and end-to-end tests, Proven ownership of observability stacks such as Prometheus, Grafana, OpenTelemetry, and Loki (or similar)
Nice to Have
Knowledge of GPU workload infrastructure, Experience with RoCE networking automation, Experience with GitOps tools such as ArgoCD, Experience with CI/CD tools such as GitHub Actions and Argo Workflows, Experience with Ansible and Terraform
What You'll Do.
Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production
Automate Slurm and Kubernetes cluster lifecycle—bootstrapping
and decommissioning at scale
Develop and maintain infrastructure for GPU node configuration
including drivers and firmware
Own cluster validation pipelines
automating health checks and GPU burn-in tests
Build day-2 operations automation
including node remediation
and automated drain/cordon workflows
Write and maintain runbooks and documentation to enable reliable
repeatable operations
Own the full observability stack for automation services
provisioning pipelines
and cluster health systems
How You'll Work.
Team & Collaboration
working closely with cross-functional partners to support business objectives
Full Job Description
About TensorWave Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure. About the role We’re looking for a Sr. Infrastructure Software Engineer to join our platform team during an exciting phase of growth. In this role, you’ll own the end-to-end automation of provisioning, configuring, and operating large-scale GPU clusters across bare metal, Kubernetes, and Slurm environments. This is a hands-on technical role focused on building the tooling and pipelines that bring hundreds of GPU nodes online reliably and repeatably—working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact. Responsibilities - Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production - Automate Slurm and Kubernetes cluster lifecycle—bootstrapping, upgrades, node provisioning, and decommissioning at scale - Develop and maintain infrastructure for GPU node configuration, including drivers and firmware - Own cluster validation pipelines, automating health checks and GPU burn-in tests - Build day-2 operations automation, including node remediation, rolling upgrades, and automated drain/cordon workflows - Write and maintain runbooks and documentation to enable reliable, repeatable operations - Own the full observability stack for automation services, provisioning pipelines, and cluster health systems Required Experience - 5+ years in infrastructure engineering or platform engineering - 3+ years writing production Go - Deep understanding of Kubernetes internals, including: - Informers and work queues - Controller-runtime and client-go - CRDs, custom controllers, and operators - Admission webho
Applying for this Senior Infrastructure Software Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Ashby
- Ashby is a fast modern ATS — most applications take under 3 minutes.
- The resume parser is strong; verify parsed experience dates and job titles.
- Custom screening questions are often scored algorithmically — answer completely.
- Location field affects geo-based screening; use your actual metro area.
ANONYMOUS · UNFILTERED
What do employees actually say about TensorWave?
Real rants from real employees. Read before you apply.