Coupang
SeniorStaffSystemEngineer,GPUFleet
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Staff System Engineer, GPU Fleet at Coupang. Skills: GPU Fleet, System architecture, Reliability, Automation. Own end-to-end technical architecture. Define technical direction”
What You'll Achieve.
Improve fleet reliability; Improve fleet availability; Improve fleet performance; Improve operating efficiency; Improve scalability; Ensure customer success
Industry & Context.
Root cause analysis; Debugging; Troubleshooting; Failure analysis
On-call rotations
What They're Looking For.
Must Have
12+ years overall experience, 8+ years Linux systems engineering, 8+ years infrastructure engineering, 8+ years datacenter operations, Linux system internals expertise, Hardware-intensive infrastructure production experience, Bare-metal servers at scale experience, Debug complex issues across system layers, Production-grade automation using Python, Production-grade automation using Bash
Nice to Have
Experience operating large-scale GPU fleets, Experience with GPU drivers, Experience with CUDA, Experience with NCCL, Experience with high-performance compute workloads, Experience with high-speed interconnects, Experience with NVLink, Experience with InfiniBand, Experience with RDMA, Experience with high-throughput Ethernet, Prior ownership of fleet-wide initiatives, Prior ownership of platform-wide initiatives, Experience partnering with hardware vendors, Experience troubleshooting systemic issues, Experience influencing future platform designs, Intuition for failure modes at scale, Act as technical authority, Act as escalation point, Mentor engineers through design reviews, Mentor engineers through technical problem solving, Mentor engineers through modelling operational ownership, Respond to high-severity production incidents, Clear post-incident reviews, Technical documentation
What You'll Do.
Own end-to-end technical architecture
Define technical direction
Define fleet architecture
Define technical standards
Enforce technical standards
Lead fleet-wide initiatives
Bring up new GPU platforms
Transition hardware generations
Redesign architectures
Make technically sound decisions
Set reliability objectives
Drive reliability objectives
Set availability objectives
Drive availability objectives
Set performance objectives
Drive performance objectives
Lead root-cause analysis
Resolve systemic failures
Identify failure patterns
Drive long-term fixes
Resolve platform-level issues
Influence future hardware designs
Design automation systems
Build automation systems
Manage GPU fleet lifecycle
Remediate failures automatically
Recover systems automatically
Replace components automatically
Eliminate manual toil
Design durable tooling
Build scalable tooling
Ensure systems are observable
Ensure systems are testable
Ensure systems are resilient
Act as senior escalation point
Participate in on-call rotations
Prevent future incidents
Lead post-incident reviews
Translate learnings into improvements
Provide technical mentorship
Guide system engineers
Guide infrastructure engineers
Serve as technical partner
Influence infrastructure roadmap
How You'll Work.
Team & Collaboration
Cross-functional influence; Partner with AI labs; Partner with enterprises; Partner with platform engineering; Partner with networking; Partner with datacenter operations
Communication Scope
Written communication; Verbal communication; Post-incident reviews; Technical documentation
Process & Methodology
Roadmap planning
Full Job Description
Company Introduction We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did I ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce. We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world. Role Overview We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting large‑scale AI training and inference workloads. This is a hands‑on, staff‑level individual contributor role with broad technical ownership, high operational impact, and significant cross‑functional influence across hardware, infrastructure, and datacenter operations. CIC builds the infrastructure for abundant intelligence. We partner with leading AI labs, governmen
Applying for this Senior Staff System Engineer, GPU Fleet role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Coupang?
Real rants from real employees. Read before you apply.