Firmus Technologies

Technology

SeniorHPCInfrastructureEngineer

A$165–235k ~AI est. Sydney, New South Wales, Australia; Launceston, Tasmania, Australia FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior HPC Infrastructure Engineer at Firmus Technologies. Skills: Kubernetes, HPC, GPU, Bare-metal provisioning. Design and implement bare-metal provisioning workflows. Deploy and manage GPU-enabled AI compute nodes”

What You'll Achieve.

Reliable provisioning of Kubernetes and Slurm AI clusters; Performance validation and optimisation; Improved operational efficiency; High-quality documentation; Effective knowledge transfer

Industry & Context.

Technology
Problems you'll solve

Root cause analysis; Troubleshooting

Eligibility Requirements

On-call rotation

What They're Looking For.

Must Have

Bachelor's or Master's degree, Experience with bare-metal cluster provisioning, Deep knowledge of Kubernetes internals, Understanding of Slurm configuration, Understanding of GPU systems, Familiarity with container runtimes, Experience with benchmarking, Practical Linux systems engineering, Automation mindset, Understanding of firmware, Proficiency in Go, Bash, Rust, or Python, Excellent documentation skills, Experience participating in on-call rotation

Nice to Have

Kubernetes CRDs, AI compute nodes, RDMA, InfiniBand, and RoCE networking, NCCL, UCX, GPUDirect, and fabric tuning, Kubernetes primitives for GPU scheduling, Slurm GPU clusters, Topology-aware configurations, MLPerf, NCCL tests, microbenchmarks, Throughput/latency validation, Observability across GPU, InfiniBand fabric, storage, and provisioning components, Hardware bring-up activities, BIOS tuning, GPU topology verification, NUMA alignment, PCIe/NVLink checks, CI/CD automation, Custom Kubernetes operators, Intelligent orchestration frameworks

What You'll Do.

Design and implement bare-metal provisioning workflows

Deploy and manage GPU-enabled AI compute nodes

Optimise Kubernetes and Slurm platforms

Implement Kubernetes primitives for GPU scheduling

and fine-tune Slurm GPU clusters

Develop and execute performance benchmarking workloads

Establish observability across GPU

Document architecture designs

operational procedures

and performance results

Collaborate with L2 SRE engineers

Support hardware bring-up activities

Contribute to continuous improvement in cluster validation

Contribute to CI/CD automation

Contribute to provisioning and testing frameworks

Contribute to development of custom Kubernetes operators

Contribute to intelligent orchestration frameworks

How You'll Work.

Team & Collaboration

L2 SRE engineers; Site operations; Networking teams

Communication Scope

Technical Communication

Full Job Description

Role Summary Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation. You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters. Key Responsibilities Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs. Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking. Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning. Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models. Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations. Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation. Establish observability across GPU, InfiniBand fabric, storage, and provisioning components. Document architecture designs, operational procedures, and performance results. Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance. Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks. Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks. Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning. Skills & Experience Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field. Experience with bare-metal cluster p

Free ATS check

Applying for this Senior HPC Infrastructure Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Firmus Technologies?

Real rants from real employees. Read before you apply.

Read Company Rants →