Coupang

SeniorStaffSystemEngineer,GPUFleet

₹65–95L ~AI est. Bengaluru, India Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Staff System Engineer, GPU Fleet at Coupang. Skills: GPU Fleet, System architecture, Reliability, Automation. Own end-to-end technical architecture. Define technical direction”

What You'll Achieve.

Improve fleet reliability; Improve fleet availability; Improve fleet performance; Improve operating efficiency; Improve scalability; Ensure customer success

Industry & Context.

Problems you'll solve

Root cause analysis; Debugging; Troubleshooting; Failure analysis

Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

12+ years overall experience, 8+ years Linux systems engineering, 8+ years infrastructure engineering, 8+ years datacenter operations, Linux system internals expertise, Hardware-intensive infrastructure production experience, Bare-metal servers at scale experience, Debug complex issues across system layers, Production-grade automation using Python, Production-grade automation using Bash

Nice to Have

Experience operating large-scale GPU fleets, Experience with GPU drivers, Experience with CUDA, Experience with NCCL, Experience with high-performance compute workloads, Experience with high-speed interconnects, Experience with NVLink, Experience with InfiniBand, Experience with RDMA, Experience with high-throughput Ethernet, Prior ownership of fleet-wide initiatives, Prior ownership of platform-wide initiatives, Experience partnering with hardware vendors, Experience troubleshooting systemic issues, Experience influencing future platform designs, Intuition for failure modes at scale, Act as technical authority, Act as escalation point, Mentor engineers through design reviews, Mentor engineers through technical problem solving, Mentor engineers through modelling operational ownership, Respond to high-severity production incidents, Clear post-incident reviews, Technical documentation

What You'll Do.

Own end-to-end technical architecture

Define technical direction

Define fleet architecture

Define technical standards

Enforce technical standards

Lead fleet-wide initiatives

Bring up new GPU platforms

Transition hardware generations

Redesign architectures

Make technically sound decisions

Set reliability objectives

Drive reliability objectives

Set availability objectives

Drive availability objectives

Set performance objectives

Drive performance objectives

Lead root-cause analysis

Resolve systemic failures

Identify failure patterns

Drive long-term fixes

Resolve platform-level issues

Influence future hardware designs

Design automation systems

Build automation systems

Manage GPU fleet lifecycle

Remediate failures automatically

Recover systems automatically

Replace components automatically

Eliminate manual toil

Design durable tooling

Build scalable tooling

Ensure systems are observable

Ensure systems are testable

Ensure systems are resilient

Act as senior escalation point

Participate in on-call rotations

Prevent future incidents

Lead post-incident reviews

Translate learnings into improvements

Provide technical mentorship

Guide system engineers

Guide infrastructure engineers

Serve as technical partner

Influence infrastructure roadmap

How You'll Work.

Team & Collaboration

Cross-functional influence; Partner with AI labs; Partner with enterprises; Partner with platform engineering; Partner with networking; Partner with datacenter operations

Communication Scope

Written communication; Verbal communication; Post-incident reviews; Technical documentation

Process & Methodology

Roadmap planning

Full Job Description

Company Introduction  We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did I ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.  We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.    Role Overview We are seeking a Sr Staff System Engineer, GPU Fleet for our Coupang Intelligent Cloud (CIC) team, to serve as the senior technical owner for our hyperscale GPU compute infrastructure. In this role, you will define fleet architecture, drive reliability and automation at scale, and lead the operation and evolution of GPU systems supporting large‑scale AI training and inference workloads. This is a hands‑on, staff‑level individual contributor role with broad technical ownership, high operational impact, and significant cross‑functional influence across hardware, infrastructure, and datacenter operations. CIC builds the infrastructure for abundant intelligence. We partner with leading AI labs, governmen

Free ATS check

Applying for this Senior Staff System Engineer, GPU Fleet role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Coupang?

Real rants from real employees. Read before you apply.

Read Company Rants →