Coupang

E-commerce

SrStaffSystemEngineer,GPUFleet

₹75–120L ~AI est. Bengaluru, India Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Sr Staff System Engineer, GPU Fleet at Coupang. Skills: GPU Fleet Architecture, System Reliability, Large-scale Automation. Own end-to-end technical architecture. Define fleet architecture”

What You'll Achieve.

Improve fleet reliability; Improve fleet availability; Improve fleet performance; Improve operating efficiency; Improve scalability; Ensure customer success

Industry & Context.

E commerce
Problems you'll solve

Root-cause analysis; Debug complex issues; Troubleshooting

Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

12+ years overall experience, 8+ years Linux systems engineering, 8+ years infrastructure engineering, 8+ years datacenter operations, Linux system internals expertise, Hardware-intensive infrastructure experience, Bare-metal servers at scale experience, Debug complex issues across system layers, Production-grade automation using Python, Production-grade automation using Bash, Design observable systems, Design resilient systems, Design safe systems under failure

Nice to Have

Direct experience operating large-scale GPU fleets, Experience with GPU drivers, Experience with CUDA, Experience with NCCL, Experience with high-performance compute workloads, Experience with high-speed interconnects, Experience with NVLink, Experience with InfiniBand, Experience with RDMA, Experience with high-throughput Ethernet, Prior ownership of fleet-wide initiatives, Prior ownership of platform-wide initiatives, Partnering directly with hardware vendors, Troubleshoot systemic issues with vendors, Influence future platform designs with vendors, Intuition for failure modes at scale, Act as technical authority, Act as escalation point for ambiguous problems, Mentor engineers through design reviews, Mentor engineers through technical problem solving, Mentor engineers through modelling operational ownership, Experience responding to high-severity incidents, Clear ownership during incidents, Urgency during incidents, Technical leadership during incidents, Written communication skills, Verbal communication skills, Clear post-incident reviews, Technical documentation

What You'll Do.

Own end-to-end technical architecture

Define fleet architecture

Drive reliability at scale

Drive automation at scale

Lead operation of GPU systems

Lead evolution of GPU systems

Define technical direction for GPU fleets

Define hardware platform selection

Define firmware strategy

Define OS configuration

Define driver strategy

Define networking strategy

Define observability strategy

Enforce technical standards

Enforce best practices for reliability

Enforce best practices for availability

Enforce best practices for performance

Enforce best practices for operability

Lead new GPU platform bring-ups

Lead multi-generation hardware transitions

Lead architectural redesigns

Evaluate cost trade-offs

Evaluate performance trade-offs

Evaluate reliability trade-offs

Evaluate time-to-deploy trade-offs

Make technically sound decisions

Set fleet-level reliability objectives

Drive fleet-level reliability objectives

Set fleet-level availability objectives

Drive fleet-level availability objectives

Set fleet-level performance objectives

Drive fleet-level performance objectives

Lead root-cause analysis

Lead resolution of systemic failures

Identify recurring failure patterns

Drive long-term fixes

Resolve platform-level issues with vendors

Influence future hardware designs

Design large-scale automation systems

Build large-scale automation systems

Automate GPU fleet provisioning

Automate GPU fleet lifecycle management

Automate GPU health validation

Automate GPU diagnostics

Automate GPU certification

Automate remediation workflows

Automate recovery workflows

Automate replacement workflows

Eliminate manual operational toil

Build durable tooling

Build well-designed tooling

Ensure fleet systems are observable

Ensure fleet systems are testable

Ensure fleet systems are resilient

Act as senior escalation point

Participate in on-call rotations

Prevent future incidents

Lead high-severity post-incident reviews

Translate learnings into improvements

Provide technical mentorship

Provide technical guidance

Serve as trusted technical partner

Influence infrastructure roadmap

Make data-driven recommendations

How You'll Work.

Team & Collaboration

Cross-functional influence; Partner with AI labs; Partner with governments; Partner with enterprises; Partner with platform engineering; Partner with networking; Partner with datacenter operations; Partner with leadership teams

Communication Scope

Post-incident reviews; Technical documentation

Process & Methodology

Roadmap planning

Full Job Description

  Please complete the attached Internal Transfer Request Form and submit.   Please make sure to apply with your Coupang e-mail address.        Company Introduction  We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did I ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.  We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.    Role Overview We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting large‑scale AI training and inference workloads. This is a hands‑on, staff‑level individual contributor role with broad technical ownership, high operational impact, and significant cross‑functional influence across hardware,

Free ATS check

Applying for this Sr Staff System Engineer, GPU Fleet role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Coupang?

Real rants from real employees. Read before you apply.

Read Company Rants →