Coupang
E-commerce
SrStaffSystemEngineer,GPUFleet
Neural analysis suggests this role is
optimal for Senior candidates.
“Sr Staff System Engineer, GPU Fleet at Coupang. Skills: GPU Fleet Architecture, System Reliability, Large-scale Automation. Own end-to-end technical architecture. Define fleet architecture”
What You'll Achieve.
Improve fleet reliability; Improve fleet availability; Improve fleet performance; Improve operating efficiency; Improve scalability; Ensure customer success
Industry & Context.
Root-cause analysis; Debug complex issues; Troubleshooting
On-call rotations
What They're Looking For.
Must Have
12+ years overall experience, 8+ years Linux systems engineering, 8+ years infrastructure engineering, 8+ years datacenter operations, Linux system internals expertise, Hardware-intensive infrastructure experience, Bare-metal servers at scale experience, Debug complex issues across system layers, Production-grade automation using Python, Production-grade automation using Bash, Design observable systems, Design resilient systems, Design safe systems under failure
Nice to Have
Direct experience operating large-scale GPU fleets, Experience with GPU drivers, Experience with CUDA, Experience with NCCL, Experience with high-performance compute workloads, Experience with high-speed interconnects, Experience with NVLink, Experience with InfiniBand, Experience with RDMA, Experience with high-throughput Ethernet, Prior ownership of fleet-wide initiatives, Prior ownership of platform-wide initiatives, Partnering directly with hardware vendors, Troubleshoot systemic issues with vendors, Influence future platform designs with vendors, Intuition for failure modes at scale, Act as technical authority, Act as escalation point for ambiguous problems, Mentor engineers through design reviews, Mentor engineers through technical problem solving, Mentor engineers through modelling operational ownership, Experience responding to high-severity incidents, Clear ownership during incidents, Urgency during incidents, Technical leadership during incidents, Written communication skills, Verbal communication skills, Clear post-incident reviews, Technical documentation
What You'll Do.
Own end-to-end technical architecture
Define fleet architecture
Drive reliability at scale
Drive automation at scale
Lead operation of GPU systems
Lead evolution of GPU systems
Define technical direction for GPU fleets
Define hardware platform selection
Define firmware strategy
Define OS configuration
Define driver strategy
Define networking strategy
Define observability strategy
Enforce technical standards
Enforce best practices for reliability
Enforce best practices for availability
Enforce best practices for performance
Enforce best practices for operability
Lead new GPU platform bring-ups
Lead multi-generation hardware transitions
Lead architectural redesigns
Evaluate cost trade-offs
Evaluate performance trade-offs
Evaluate reliability trade-offs
Evaluate time-to-deploy trade-offs
Make technically sound decisions
Set fleet-level reliability objectives
Drive fleet-level reliability objectives
Set fleet-level availability objectives
Drive fleet-level availability objectives
Set fleet-level performance objectives
Drive fleet-level performance objectives
Lead root-cause analysis
Lead resolution of systemic failures
Identify recurring failure patterns
Drive long-term fixes
Resolve platform-level issues with vendors
Influence future hardware designs
Design large-scale automation systems
Build large-scale automation systems
Automate GPU fleet provisioning
Automate GPU fleet lifecycle management
Automate GPU health validation
Automate GPU diagnostics
Automate GPU certification
Automate remediation workflows
Automate recovery workflows
Automate replacement workflows
Eliminate manual operational toil
Build durable tooling
Build well-designed tooling
Ensure fleet systems are observable
Ensure fleet systems are testable
Ensure fleet systems are resilient
Act as senior escalation point
Participate in on-call rotations
Prevent future incidents
Lead high-severity post-incident reviews
Translate learnings into improvements
Provide technical mentorship
Provide technical guidance
Serve as trusted technical partner
Influence infrastructure roadmap
Make data-driven recommendations
How You'll Work.
Team & Collaboration
Cross-functional influence; Partner with AI labs; Partner with governments; Partner with enterprises; Partner with platform engineering; Partner with networking; Partner with datacenter operations; Partner with leadership teams
Communication Scope
Post-incident reviews; Technical documentation
Process & Methodology
Roadmap planning
Full Job Description
Please complete the attached Internal Transfer Request Form and submit. Please make sure to apply with your Coupang e-mail address. Company Introduction We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did I ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce. We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world. Role Overview We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting large‑scale AI training and inference workloads. This is a hands‑on, staff‑level individual contributor role with broad technical ownership, high operational impact, and significant cross‑functional influence across hardware,
Applying for this Sr Staff System Engineer, GPU Fleet role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Greenhouse
- Create a Greenhouse profile before applying — it saves time across multiple applications.
- Upload your resume as a PDF; the parser handles it better than Word.
- Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
- Enable email notifications to track application status in real time.
ANONYMOUS · UNFILTERED
What do employees actually say about Coupang?
Real rants from real employees. Read before you apply.