TensorWave

AI compute

SeniorInfrastructureSoftwareEngineer

Las Vegas, Nevada, United States FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Infrastructure Software Engineer at TensorWave. Skills: Infrastructure Software Engineering, Automation, Kubernetes, GPU Clusters, Go. Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production. Automate Slurm and Kubernetes cluster lifecycle—bootstrapping, upgrades, node provisioning, and decommissioning at scale”

What You'll Achieve.

deliver seamless, secure, reliable, and resilient AI compute at scale; eliminate infrastructure barriers; bring hundreds of GPU nodes online reliably and repeatably; support business objectives; upholding our standards for excellence, collaboration, and impact

Industry & Context.

AI compute

What They're Looking For.

Must Have

5+ years in infrastructure engineering or platform engineering, 3+ years writing production Go, Deep understanding of Kubernetes internals, Experience building Kubernetes Operators, Experience building gRPC and REST APIs in Go at production scale, Familiarity with bare metal infrastructure concepts, including PXE, IPMI, and BMC, testing discipline across unit, integration, and end-to-end tests, Proven ownership of observability stacks such as Prometheus, Grafana, OpenTelemetry, and Loki (or similar)

Nice to Have

Knowledge of GPU workload infrastructure, Experience with RoCE networking automation, Experience with GitOps tools such as ArgoCD, Experience with CI/CD tools such as GitHub Actions and Argo Workflows, Experience with Ansible and Terraform

What You'll Do.

Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production

Automate Slurm and Kubernetes cluster lifecycle—bootstrapping

and decommissioning at scale

Develop and maintain infrastructure for GPU node configuration

including drivers and firmware

Own cluster validation pipelines

automating health checks and GPU burn-in tests

Build day-2 operations automation

including node remediation

and automated drain/cordon workflows

Write and maintain runbooks and documentation to enable reliable

repeatable operations

Own the full observability stack for automation services

provisioning pipelines

and cluster health systems

How You'll Work.

Team & Collaboration

working closely with cross-functional partners to support business objectives

Full Job Description

About TensorWave Our mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure. About the role We’re looking for a Sr. Infrastructure Software Engineer to join our platform team during an exciting phase of growth. In this role, you’ll own the end-to-end automation of provisioning, configuring, and operating large-scale GPU clusters across bare metal, Kubernetes, and Slurm environments. This is a hands-on technical role focused on building the tooling and pipelines that bring hundreds of GPU nodes online reliably and repeatably—working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact. Responsibilities - Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production - Automate Slurm and Kubernetes cluster lifecycle—bootstrapping, upgrades, node provisioning, and decommissioning at scale - Develop and maintain infrastructure for GPU node configuration, including drivers and firmware - Own cluster validation pipelines, automating health checks and GPU burn-in tests - Build day-2 operations automation, including node remediation, rolling upgrades, and automated drain/cordon workflows - Write and maintain runbooks and documentation to enable reliable, repeatable operations - Own the full observability stack for automation services, provisioning pipelines, and cluster health systems Required Experience - 5+ years in infrastructure engineering or platform engineering - 3+ years writing production Go - Deep understanding of Kubernetes internals, including: - Informers and work queues - Controller-runtime and client-go - CRDs, custom controllers, and operators - Admission webho

Free ATS check

Applying for this Senior Infrastructure Software Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Ashby

  • Ashby is a fast modern ATS — most applications take under 3 minutes.
  • The resume parser is strong; verify parsed experience dates and job titles.
  • Custom screening questions are often scored algorithmically — answer completely.
  • Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about TensorWave?

Real rants from real employees. Read before you apply.

Read Company Rants →