Coupang

e-commerce

SeniorStaffCloudBackendEngineer-ObservabilityandSiteReliability

Bengaluru, India

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Staff Cloud Backend Engineer - Observability and Site Reliability at Coupang. Skills: Observability, Site Reliability Engineering (SRE), Cloud Backend Engineering, Datacenter Infrastructure, Monitoring, Logging, Alerting, Telemetry, Automation, Performance Optimization, Scalability, Reliability. Design, implement, and maintain observability solutions for datacenter infrastructure, including monitoring, logging, alerting, and telemetry systems. Develop, deploy, and operate large-scale obse”

What You'll Achieve.

ensure high availability, reliability, and performance of infrastructure; provide clear visibility into system health, performance, and capacity trends; improve reliability, scalability, and operational efficiency of datacenter services; prevent recurrence and improve system resilience; enhance performance, efficiency, and resource utilization; safeguard infrastructure, systems, and data; support efficient issue resolution; Continuously enhance the scalability, reliability, and operational efficiency of datacenter services through proactive improvements

Industry & Context.

e commerce

Problems you'll solve

solve problems; debugging complex hardware and software problems; root cause analysis (RCA)

Eligibility Requirements

on-call

What They're Looking For.

Must Have

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field, 12+ years of progressive software engineering experience, with a heavy emphasis on distributed systems, cloud-native architectures, or platform operations, Proven experience in managing and optimizing large-scale datacenter environments, Proficiency in Go or Python, with a deep understanding of networked systems and performance optimization, Expert-level knowledge of Kubernetes internals (scheduling, controllers) and containerization ecosystems, Proven experience with load balancing, service mesh, and request routing at scale, Proficiency in observability tools and technologies (e.g., Prometheus, Grafana, ELK Stack), Experience with SRE practices and tools (e.g., Kubernetes, Docker, Terraform), Familiarity with cloud platforms (AWS, Azure, GCP) and their observability and reliability services

Nice to Have

Prior experience building infrastructure specifically for LLM inference or large-scale training clusters, Familiarity with inference, including mixed precision, kernel tuning, or custom hardware accelerators, Experience managing hybrid-cloud or multi-AZ deployments across AWS, Azure, or GCP, Experience operating in regulated environments with strict security and compliance requirements

What You'll Do.

and maintain observability solutions for datacenter infrastructure

and telemetry systems

and operate large-scale observability and telemetry platforms with a focus on real-time monitoring

Own and contribute to the full lifecycle of observability services—from design and development to deployment and ongoing optimization

Build and enhance monitoring systems to ensure high availability

and performance of infrastructure

Create and manage dashboards

and reports to provide clear visibility into system health

Apply SRE principles and best practices to improve reliability

and operational efficiency of datacenter services

Develop and maintain automation for infrastructure provisioning

and system management

Lead root cause analysis (RCA) and post-incident reviews

driving corrective actions to prevent recurrence and improve system resilience

Analyze system and application performance across the datacenter infrastructure to identify bottlenecks and improvement areas

Implement optimization strategies to enhance performance

and resource utilization

Ensure observability and reliability solutions adhere to organizational security policies and industry standards

Implement and maintain appropriate security controls to safeguard infrastructure

Provide hands-on support for observability and reliability issues

including debugging complex hardware and software problems

Develop and maintain documentation

including troubleshooting guides and operational best practices

to support efficient issue resolution

Stay current with emerging trends

and technologies in observability and SRE

and incorporate them into the platform

Continuously enhance the scalability

and operational efficiency of datacenter services through proactive improvements

How You'll Work.

Team & Collaboration

Collaborate with cross-functional engineering teams to understand observability and reliability requirements and deliver effective solutions; Collaborate with hardware and software vendors to evaluate, integrate, and optimize new technologies within the ecosystem; Partner with cross-functional engineering teams

Full Job Description

Company Introduction  We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.  We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been since our inception. We are all entrepreneurs surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day. Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.    Role Overview As a Senior Staff Data Centre Observability and Site Reliability Engineer, you will design, build, and operate scalable observability and reliability solutions for large-scale datacenter infrastructure. This role focuses on developing high-performance monitoring and telemetry platforms, ensuring system reliability, and driving operational excellence through automation, performance optimization, and SRE best practices. The ideal candidate will work across the full service lifecycle—design, deployment, and continuous improvement—while collaborating with cross-functional teams to enhance visibility, resilience, and efficiency of critical systems. What You Will Do Observability and Monitoring Design, imp

Free ATS check

Applying for this Senior Staff Cloud Backend Engineer - Observability and Site Reliability role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 43 detected · ranked by frequency

Performance Optimization ×3

designing, building, and operating scalable observability and reliability solutions ×3

developing high-performance monitoring and telemetry platforms ×3

ensuring system reliability ×3

automation for infrastructure provisioning, monitoring, and system management ×3

root cause analysis (RCA) ×3

post-incident reviews ×3

analyzing system and application performance ×3

implementing optimization strategies ×3

debugging complex hardware and software problems ×3

Observability ×2

Site Reliability Engineering (SRE) ×2

Cloud Backend Engineering ×2

Datacenter Infrastructure ×2

Monitoring ×2

Logging ×2

Alerting ×2

Telemetry ×2

Automation ×2

Scalability ×2

Reliability ×2

Kubernetes ×2

Docker ×2

Terraform ×2

Prometheus ×2

Grafana ×2

ELK Stack ×2

Python

AWS

Azure

GCP

BEHAVIOURAL

collaborationentrepreneurial spiritambitioushands-on impact

Role Details

Experience 12–10 yrs

Level Senior

Work Mode Hybrid

Education Bachelor’s or Master’s degree in Computer Science, Engineeri

Category coupang-intelligent-cloud-(cic)

AI-Extracted Insights

Domain Areas

distributed-systemscloud-native-architecturesplatform-operationslarge-scale-datacenter-environmentsnetworked-systemskubernetes-internalscontainerization-ecosystemsload-balancing

ANONYMOUS · UNFILTERED

What do employees actually say about Coupang?

Real rants from real employees. Read before you apply.

Read Company Rants →