Together AI

Technology

StaffEngineer,DistributedStorageandHPC&AIInfrastructure

$250–300k San Francisco, California, United States FULL TIME Remote Friendly

The Brief

“Staff Engineer, Distributed Storage and HPC & AI Infrastructure at Together AI. Skills: Distributed Storage, Kubernetes Storage, High-Performance Computing. Design multi-petabyte AI/ML storage. Integrate WekaFS, Ceph, etc.”

What You'll Achieve.

Achieve 30-50% cost savings; Optimize end-to-end data paths; Ensure 99.9%+ uptime

Industry & Context.

Technology

Problems you'll solve

Troubleshoot storage; Troubleshoot TCP/IP; Root cause analysis; Profiling; Automated remediation

What They're Looking For.

Must Have

8+ years storage engineering, 3+ years managing distributed storage, Proven track record deploying/operating high-performance storage, Deep Kubernetes/cloud-native storage experience, Production Kubernetes storage experience, Go and Python coding skills, Build production-grade tools, BS/MS in Computer Science/Engineering, Equivalent practical experience, History of technical leadership, Design systems improving performance (>3x), Design systems improving reliability (99.9%+ uptime), Design systems improving cost efficiency, Deep expertise WekaFS, Lustre, GPFS, BeeGFS, Parallel filesystems multi-petabyte scale, Production experience S3, MinIO, Ceph, R2, Object storage performance optimization, Object storage cost management, Kubernetes CSI drivers, Kubernetes StatefulSets, Kubernetes PersistentVolumes, Kubernetes storage operators, Kubernetes custom controllers, Storage optimization for GPU workloads, RDMA/InfiniBand networking, Parallel filesystem optimization, Go programming for automation, Python programming for automation, Go programming for operators, Python programming for operators, Go programming for tooling, Python programming for tooling, Terraform, Ansible, Helm, GitOps, Advanced Linux storage stack knowledge, Advanced filesystems knowledge, Advanced LVM knowledge, Advanced NVMe optimization knowledge, Advanced RAID configurations knowledge, Prometheus architecture and operations, Grafana architecture and operations, Thanos architecture and operations

Nice to Have

GPU Direct Storage (GDS), NVMe-oF, Storage networking (100GbE/400GbE), ML/AI storage patterns, Kubernetes operator development, Storage snapshots, Storage cloning, Storage thin provisioning, Backup and disaster recovery, Storage encryption, Storage security, Storage compliance, Storage benchmarking tools, Storage profiling tools

What You'll Do.

Design multi-petabyte AI/ML storage

Lead capacity planning

Optimize costs (30-50% savings)

Tune for max throughput/min latency

Implement NVMe-oF/iSCSI

Troubleshoot/optimize TCP/IP for storage

Build Kubernetes storage operators/controllers

Enable automated provisioning

Enable self-service abstractions

Enable multi-tenant isolation

Create reusable Helm/Terraform patterns

Deliver 10-50 GB/s per GPU

Optimize caching (weights/datasets/checkpoints)

Optimize parallel filesystems

Optimize data for AI workloads

Troubleshoot with profiling

Scale to thousands of nodes

Build multi-tier caches

Optimize data locality

Optimize model-weight distribution

Implement smart prefetching/eviction

Remediate proactively/automatically

Mentor on storage best practices

Contribute to open-source

Write public learnings

How You'll Work.

Team & Collaboration

Partner with ML/SRE; Mentor on storage best practices

Communication Scope

Technical documentation; Public learnings

Process & Methodology

Capacity planning, Cost optimization, Lifecycle policies, Roadmap planning

Free ATS check

Applying for this Staff Engineer, Distributed Storage and HPC & AI Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

Skill Signal 72 detected

Core

Kubernetes Storage ×6

Infrastructure as Code ×4

Linux storage stack ×4

Storage benchmarking ×4

Storage profiling ×4

Required

Distributed storage systems ×3

Parallel filesystems ×3

Object storage ×3

Storage optimization ×3

RDMA networking ×3

InfiniBand networking ×3

Parallel filesystem optimization ×3

Distributed Storage ×2

High-Performance Computing ×2

WekaFS ×2

Nice to have

Python

MinIO

CSI drivers

StatefulSets

PersistentVolumes

ext4

xfs

Behavioural

Leadership

Mentoring

Role Details

Work Mode

Remote

Type

FULL TIME

Experience

8–10 yrs

Salary Band

200k+

How to Apply on Greenhouse

Create a Greenhouse profile before applying — it saves time across multiple applications.
Upload your resume as a PDF; the parser handles it better than Word.
Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Together AI?

Real rants from real employees. Read before you apply.

Read Company Rants →