Coupang

e-commerce

StaffCloudBackendEngineer

Bengaluru, India

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Lead candidates.

The Brief

“Staff Cloud Backend Engineer at Coupang. Skills: Observability, Site Reliability Engineering (SRE), Kubernetes, Go, Python, Distributed systems, Cloud-native architectures, Platform operations. Own the design and operation of scalable observability platforms to ensure the reliability, performance, and availability of datacentre services. Apply SRE best practices, automation, and performance optimization to deliver resilient infrastructure”

Industry & Context.

e commerce

Problems you'll solve

Solve problems; Troubleshooting; Root cause analysis

What They're Looking For.

Must Have

8–12 years of progressive software engineering experience, with a heavy emphasis on distributed systems, cloud-native architectures, or platform operations, Proficiency in Go or Python, Deep understanding of networked systems and performance optimization, Expert-level knowledge of Kubernetes internals (scheduling, controllers) and containerization ecosystems, Proven experience with load balancing, service mesh, and request routing at scale, A "ownership" mindset with a track record of maintaining mission-critical, high-availability systems in production

Nice to Have

Prior experience building infrastructure specifically for LLM inference or large-scale training clusters, Familiarity with inference, including mixed precision, kernel tuning, or custom hardware accelerators, Experience managing hybrid-cloud or multi-AZ deployments across AWS, Azure, or GCP, Experience operating in regulated environments with strict security and compliance requirements

What You'll Do.

Own the design and operation of scalable observability platforms to ensure the reliability

and availability of datacentre services

Apply SRE best practices

and performance optimization to deliver resilient infrastructure

and maintain observability solutions for datacentre infrastructure

and maintain the operational and reliability components of a large-scale Observability and Telemetry collection platform

Participate in and enhance the entire lifecycle of services

from inception and design to deployment

Develop and optimize monitoring systems

Create and manage dashboards

Implement SRE best practices to improve the reliability

and performance of datacentre services

Develop and maintain automation scripts for infrastructure provisioning

Conduct root cause analysis and post-mortem reviews

Analyze and optimize the performance of datacentre systems and applications

Implement best practices for resource utilization and efficiency

Ensure that observability and reliability solutions comply with security policies and industry standards

Implement and maintain security measures to protect data and infrastructure

Provide support for observability and reliability-related issues

Develop and maintain documentation for troubleshooting procedures and best practices

Stay updated with the latest advancements in observability and SRE technologies and integrate them into the infrastructure

Continuously improve the reliability

and performance of datacentre services

How You'll Work.

Team & Collaboration

Partners closely with engineering teams and vendors to drive operational excellence; Work closely with other engineering teams to understand and meet their observability and reliability requirements; Collaborate with hardware and software vendors to evaluate and integrate new technologies

Full Job Description

Company Introduction  We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.  We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional trade-offs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.    Role Overview As a Staff Data Centre Observability and Site Reliability Engineer, you will own the design and operation of scalable observability platforms to ensure the reliability, performance, and availability of datacentre services. You will apply SRE best practices, automation, and performance optimization to deliver resilient infrastructure. This role partners closely with engineering teams and vendors to drive operational excellence while maintaining security and compliance standards. What You Will Do Observability and Monitoring: • Design, implement, and maintain observability solutions for datacentre infrastructure. • Develop, deploy, and maintain the operational and reliability components of a large-sca

Free ATS check

Applying for this Staff Cloud Backend Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 36 detected · ranked by frequency

Kubernetes ×4

Observability ×3

Site Reliability Engineering (SRE) ×3

Go ×3

Python ×3

Design, implement, and maintain observability solutions ×3

Develop, deploy, and maintain operational and reliability components of a large-scale Observability and Telemetry collection platform ×3

Develop and optimize monitoring systems ×3

Create and manage dashboards, alerts, and reports ×3

Implement SRE best practices ×3

Develop and maintain automation scripts ×3

Analyze and optimize performance of datacentre systems and applications ×3

Implement best practices for resource utilization and efficiency ×3

Provide support for observability and reliability-related issues ×3

Debugging hardware and software problems ×3

Distributed systems ×2

Cloud-native architectures ×2

Platform operations ×2

AWS

Azure

GCP

LLM inference

large-scale training clusters

hybrid-cloud

multi-AZ deployments

Performance optimization

Automation

Operational excellence

Security

Compliance

Troubleshooting

Root cause analysis

BEHAVIOURAL

Ownership mindsetCollaboration

Role Details

Experience 8–12 yrs

Level Lead

Work Mode Hybrid

Education Bachelor’s or Master’s degree in Computer Science, Engineeri

Category coupang-intelligent-cloud-(cic)

AI-Extracted Insights

Domain Areas

datacentre-servicesobservabilitytelemetry-collectionllm-inferencelarge-scale-training-clusters

ANONYMOUS · UNFILTERED

What do employees actually say about Coupang?

Real rants from real employees. Read before you apply.

Read Company Rants →