Together AI
Technology
StaffEngineer,DistributedStorageandHPC&AIInfrastructure
“Staff Engineer, Distributed Storage and HPC & AI Infrastructure at Together AI. Skills: Distributed Storage, Kubernetes Storage, High-Performance Computing. Design multi-petabyte AI/ML storage. Integrate WekaFS, Ceph, etc.”
What You'll Achieve.
Achieve 30-50% cost savings; Optimize end-to-end data paths; Ensure 99.9%+ uptime
Industry & Context.
Troubleshoot storage; Troubleshoot TCP/IP; Root cause analysis; Profiling; Automated remediation
What They're Looking For.
Must Have
8+ years storage engineering, 3+ years managing distributed storage, Proven track record deploying/operating high-performance storage, Deep Kubernetes/cloud-native storage experience, Production Kubernetes storage experience, Go and Python coding skills, Build production-grade tools, BS/MS in Computer Science/Engineering, Equivalent practical experience, History of technical leadership, Design systems improving performance (>3x), Design systems improving reliability (99.9%+ uptime), Design systems improving cost efficiency, Deep expertise WekaFS, Lustre, GPFS, BeeGFS, Parallel filesystems multi-petabyte scale, Production experience S3, MinIO, Ceph, R2, Object storage performance optimization, Object storage cost management, Kubernetes CSI drivers, Kubernetes StatefulSets, Kubernetes PersistentVolumes, Kubernetes storage operators, Kubernetes custom controllers, Storage optimization for GPU workloads, RDMA/InfiniBand networking, Parallel filesystem optimization, Go programming for automation, Python programming for automation, Go programming for operators, Python programming for operators, Go programming for tooling, Python programming for tooling, Terraform, Ansible, Helm, GitOps, Advanced Linux storage stack knowledge, Advanced filesystems knowledge, Advanced LVM knowledge, Advanced NVMe optimization knowledge, Advanced RAID configurations knowledge, Prometheus architecture and operations, Grafana architecture and operations, Thanos architecture and operations
Nice to Have
GPU Direct Storage (GDS), NVMe-oF, Storage networking (100GbE/400GbE), ML/AI storage patterns, Kubernetes operator development, Storage snapshots, Storage cloning, Storage thin provisioning, Backup and disaster recovery, Storage encryption, Storage security, Storage compliance, Storage benchmarking tools, Storage profiling tools
What You'll Do.
Design multi-petabyte AI/ML storage
Lead capacity planning
Optimize costs (30-50% savings)
Tune for max throughput/min latency
Implement NVMe-oF/iSCSI
Troubleshoot/optimize TCP/IP for storage
Build Kubernetes storage operators/controllers
Enable automated provisioning
Enable self-service abstractions
Enable multi-tenant isolation
Create reusable Helm/Terraform patterns
Deliver 10-50 GB/s per GPU
Optimize caching (weights/datasets/checkpoints)
Optimize parallel filesystems
Optimize data for AI workloads
Troubleshoot with profiling
Scale to thousands of nodes
Build multi-tier caches
Optimize data locality
Optimize model-weight distribution
Implement smart prefetching/eviction
Remediate proactively/automatically
Mentor on storage best practices
Contribute to open-source
Write public learnings
How You'll Work.
Team & Collaboration
Partner with ML/SRE; Mentor on storage best practices
Communication Scope
Technical documentation; Public learnings
Process & Methodology
Capacity planning, Cost optimization, Lifecycle policies, Roadmap planning
Applying for this Staff Engineer, Distributed Storage and HPC & AI Infrastructure role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Together AI?
Real rants from real employees. Read before you apply.