Firmus Technologies
Technology
SeniorHPCInfrastructureEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior HPC Infrastructure Engineer at Firmus Technologies. Skills: Kubernetes, HPC, GPU, Bare-metal provisioning. Design and implement bare-metal provisioning workflows. Deploy and manage GPU-enabled AI compute nodes”
What You'll Achieve.
Reliable provisioning of Kubernetes and Slurm AI clusters; Performance validation and optimisation; Improved operational efficiency; High-quality documentation; Effective knowledge transfer
Industry & Context.
Root cause analysis; Troubleshooting
On-call rotation
What They're Looking For.
Must Have
Bachelor's or Master's degree, Experience with bare-metal cluster provisioning, Deep knowledge of Kubernetes internals, Understanding of Slurm configuration, Understanding of GPU systems, Familiarity with container runtimes, Experience with benchmarking, Practical Linux systems engineering, Automation mindset, Understanding of firmware, Proficiency in Go, Bash, Rust, or Python, Excellent documentation skills, Experience participating in on-call rotation
Nice to Have
Kubernetes CRDs, AI compute nodes, RDMA, InfiniBand, and RoCE networking, NCCL, UCX, GPUDirect, and fabric tuning, Kubernetes primitives for GPU scheduling, Slurm GPU clusters, Topology-aware configurations, MLPerf, NCCL tests, microbenchmarks, Throughput/latency validation, Observability across GPU, InfiniBand fabric, storage, and provisioning components, Hardware bring-up activities, BIOS tuning, GPU topology verification, NUMA alignment, PCIe/NVLink checks, CI/CD automation, Custom Kubernetes operators, Intelligent orchestration frameworks
What You'll Do.
Design and implement bare-metal provisioning workflows
Deploy and manage GPU-enabled AI compute nodes
Optimise Kubernetes and Slurm platforms
Implement Kubernetes primitives for GPU scheduling
and fine-tune Slurm GPU clusters
Develop and execute performance benchmarking workloads
Establish observability across GPU
Document architecture designs
operational procedures
and performance results
Collaborate with L2 SRE engineers
Support hardware bring-up activities
Contribute to continuous improvement in cluster validation
Contribute to CI/CD automation
Contribute to provisioning and testing frameworks
Contribute to development of custom Kubernetes operators
Contribute to intelligent orchestration frameworks
How You'll Work.
Team & Collaboration
L2 SRE engineers; Site operations; Networking teams
Communication Scope
Technical Communication
Full Job Description
Role Summary Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation. You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters. Key Responsibilities Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs. Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking. Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning. Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models. Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations. Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation. Establish observability across GPU, InfiniBand fabric, storage, and provisioning components. Document architecture designs, operational procedures, and performance results. Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance. Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks. Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks. Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning. Skills & Experience Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field. Experience with bare-metal cluster p
Applying for this Senior HPC Infrastructure Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Firmus Technologies?
Real rants from real employees. Read before you apply.