Company

Technology

ObservabilitySpecialist

India FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Observability Specialist. Skills: OpenTelemetry, Prometheus, Grafana, Jaeger, Loki, Python, Java, Go, SLI/SLO frameworks, incident management. Design and implement scalable observability architecture using OpenTelemetry for distributed systems and AI-driven platforms. Build and maintain metrics, logging, and tracing infrastructure. Define and enforce instrumentation standards. Implement distributed tracing and context propagation. Develop dashboards, SLIs/SLOs, and alerting systems. Create custo”

What You'll Achieve.

Opportunity to influence reliability standards and observability practices at scale.

Industry & Context.

Technology

Problems you'll solve

distributed systems; root cause analysis; troubleshooting

What They're Looking For.

Must Have

Bachelor’s degree in Computer Science or a related technical field. 5+ years of experience in SRE, observability, or platform engineering roles in distributed systems environments. hands-on expertise with OpenTelemetry, including metrics, logs, and tracing. Experience with monitoring and visualization tools such as Prometheus, Grafana, and alerting frameworks. knowledge of distributed tracing tools such as Jaeger, Zipkin, or equivalent systems. Experience with log aggregation tools like ELK stack, Loki, or similar solutions. Proficiency in Python, Java, or Go for instrumentation and automation. understanding of SLI/SLO frameworks, alerting strategies, and incident management practices.

Nice to Have

Familiarity with Kubernetes observability, service mesh telemetry, and cloud-native architectures is a plus. Exposure to AI/ML observability, LLM monitoring, or enterprise ERP systems is highly valued.

What You'll Do.

Design and implement scalable observability architecture using OpenTelemetry for distributed systems and AI-driven platforms.

Build and maintain metrics, logging, and tracing infrastructure.

Define and enforce instrumentation standards.

Implement distributed tracing and context propagation.

Develop dashboards, SLIs/SLOs, and alerting systems.

Create custom metrics and telemetry for AI agent behavior, LLM performance, and system-level insights.

Design alerting strategies, escalation paths, and incident response workflows.

Support root cause analysis and production troubleshooting.

How You'll Work.

Team & Collaboration

experience working in distributed global teams.

Communication Scope

communication

Full Job Description

## Accountabilities Design and implement scalable observability architecture using OpenTelemetry for distributed systems and AI-driven platforms. Build and maintain metrics, logging, and tracing infrastructure using tools such as Prometheus, Grafana, Jaeger, Loki, and related stacks. Define and enforce instrumentation standards across Java, Python, and web-based applications. Implement distributed tracing and context propagation across microservices, MCP workflows, and ERP system integrations. Develop dashboards, SLIs/SLOs, and alerting systems to monitor platform health, performance, and reliability. Create custom metrics and telemetry for AI agent behavior, LLM performance, and system-level insights. Design alerting strategies, escalation paths, and incident response workflows to reduce noise and improve reliability. Support root cause analysis and production troubleshooting using observability data and structured diagnostics. Requirements: Bachelor’s degree in Computer Science or a related technical field. 5+ years of experience in SRE, observability, or platform engineering roles in distributed systems environments. Strong hands-on expertise with OpenTelemetry, including metrics, logs, and tracing. Experience with monitoring and visualization tools such as Prometheus, Grafana, and alerting frameworks. Strong knowledge of distributed tracing tools such as Jaeger, Zipkin, or equivalent systems. Experience with log aggregation tools like ELK stack, Loki, or similar solutions. Proficiency in Python, Java, or Go for instrumentation and automation. Strong understanding of SLI/SLO frameworks, alerting strategies, and incident management practices. Familiarity with Kubernetes observability, service mesh telemetry, and cloud-native architectures is a plus. Exposure to AI/ML observability, LLM monitoring, or enterprise ERP systems is highly valued. Strong analytical, debugging, and communication skills with experience working in distributed global teams. Benefits: Opportu

Free ATS check

Applying for this Observability Specialist role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 15 detected · ranked by frequency

OpenTelemetry ×4

Prometheus ×4

Grafana ×4

Jaeger ×4

Loki ×4

Python ×4

Java ×4

Go ×4

SLI/SLO frameworks ×3

incident management ×3

OpenTelemetry, Prometheus, Grafana, Jaeger, Loki, Python, Java, Go, SLI/SLO frameworks, incident management ×2

Kubernetes

service mesh

AI-driven platforms, enterprise ERP systems, AI/ML observability, LLM monitoring

OpenTelemetry, Prometheus, Grafana, Jaeger, Loki, ELK stack, Zipkin

BEHAVIOURAL

analytical, debugging, communication

Role Details

Seniority Senior

Experience 5–10 yrs

Level Senior

Work Mode Remote

Type FULL TIME

Education Bachelor

Category software

How to Apply on Lever

Lever uses a streamlined one-page form — apply in under 5 minutes.
LinkedIn import works well; review parsed data before submitting.
The cover letter field is optional but visible to reviewers — use it to differentiate.
Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →