Company

LeadSiteReliabilityEngineer

Pune, India FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Lead candidates.

The Brief

“Lead Site Reliability Engineer. Skills: Site Reliability Engineering, Cloud Platform Technologies, Distributed Systems, Kubernetes, Automation, Monitoring, Incident Response. Co-develop and participate in the full lifecycle development of cloud platform services from inception and design, deployment, operation, and improvement. Increase the effectiveness, reliability, availability and performance of cloud platform technologies”

What You'll Achieve.

Increase the effectiveness, reliability, availablity and performance of cloud platform technologies; Ensure that the cloud platform technologies are maintained properly; Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need; drive efficiencies in systems and processes

Industry & Context.

Problems you'll solve

identifying performance bottlenecks; identifying anomalous system behavior; determining the root cause of incidents; root cause analysis

Eligibility Requirements

Participate in on-call rotation for cloud platform technologies

What They're Looking For.

Must Have

10+ years of relevant experience in running distributed systems at scale in production, Expertise in one of the programming languages: Java, Python or Go, Proficient in writing bash scripts, Good understanding of SQL and NoSQL systems, Good understanding of systems programming (network stack, file system, OS services), Excellent knowledge in Kubernetes, Understanding of network elements such as firewalls, load balancers, DNS, NAT, TLS/SSL, VLANs etc, Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and determining the root cause of incidents, Knowledge of JVM concepts like garbage collection, heap, stack, profiling, class loading, etc., Knowledge of best practices related to security, performance, high-availability, and disaster recovery, Demonstrate a proven record of handling production issues, planning escalation procedures, conducting post-mortems, impact analysis, risk assessments and other related procedures, Able to drive results and set priorities independently, Hands-on knowledge of creating effective monitoring, observability dashboards, creating alerts to prevents production outages, BS/MS degree in Computer Science, Applied Math, or related field

Nice to Have

Experience with managing large scale deployments of search engines like Elasticsearch, Experience with managing large scale deployments of message-oriented middleware such as Kafka, Experience with managing large scale deployments of RDBMS systems such as oracle, Experience with managing large scale deployments of NoSQL databases such as Cassandra, Experience with managing large scale deployments of In-memory caching using Redis, Memcached, etc., Experience with container and orchestration technologies such as Docker, Kubernetes etc, Experience with monitoring tools such as Graphite, clickhouse, Grafana and Prometheus, Experience with observability tools like AppDynamics, Dyantrace or similar tools, Experience with Hashicorp technologies such as Consul, Vault, Terraform and Vagrant, Experience with configuration management tools such as Chef, Puppet or Ansible, In-depth experience with continuous integration and continuous deployment pipelines, Exposure to Maven, Ant or Gradle for builds

What You'll Do.

Co-develop and participate in the full lifecycle development of cloud platform services from inception and design

Increase the effectiveness

availability and performance of cloud platform technologies

Support cloud platform team before the technologies are pushed for production release

Ensure that the cloud platform technologies are maintained properly

Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need

Participate in the development process by supporting new features

Develop tools and automate the process for achieving large scale provisioning and deployment of cloud platform technologies

Participate in on-call rotation for cloud platform technologies

Lead incident response

Be part of writing detailed postmortem analysis reports

Propose improvements and drive efficiencies in systems and processes

Participate/work with engineering team by participating in architecture

design reviews for system improvements

Create tech-depth tasks for engineering team for increasing site reliability in production

How You'll Work.

Team & Collaboration

Co-develop and participate in the full lifecycle development of cloud platform services; Support cloud platform team before the technologies are pushed for production release; Advice the cloud platform team to improve the reliability of the systems; Participate in the development process by supporting new features, services, releases; Participate/work with engineering team by participating in architecture, design reviews for system improvements; Create tech-depth tasks for engineering team

Process & Methodology

set priorities independently

Full Job Description

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world! Co-develop and participate in the full lifecycle development of cloud platform services from inception and design, deployment, operation, and improvement by applying scientific principles. \- Increase the effectiveness, reliability, availablity and performance of cloud platform technologies by identifying and measuring key indicators, making changes to the production systems in an automated way and evaluating the results. \- Support cloud platform team before the technologies are pushed for production release through activities such as system design, capacity planning, automation of key deployments, engaging in building a strategy for production monitoring and alerting and participate in testing/verification process. \- Ensure that the cloud platform technologies are maintained properly by measuring and monitoring availability, latency, performance, and system health. Advice the cloud platform team to improve the reliability of the systems in production and scale them based on need. \- Participate in the development process by supporting new features, services, releases and hold an ownership mindset for the cloud platform technologies Develop tools and automate the process for achieving large scale provisioning and deployment of cloud platform technologies \- Participate in on-call rotation for cloud platform technologies. At times of incidents, lead incident response and be part of writing detailed postmortem analysis reports which are brutally honest with no-blame. \- Propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting and root cause analysis \- Participate/work with engineering team by participating in architecture, design reviews for system improvements. Create tech-depth tasks for engineering team for increasing site

Free ATS check

Applying for this Lead Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 68 detected · ranked by frequency

Automation ×5

Monitoring ×5

Incident Response ×5

Kubernetes ×4

cloud platform services development ×3

system design ×3

alerting ×3

testing/verification ×3

measuring availability, latency, performance, and system health ×3

scaling services ×3

large scale provisioning ×3

postmortem analysis ×3

efficiency improvements ×3

architecture reviews ×3

design reviews ×3

tech-depth task creation ×3

Site Reliability Engineering ×2

Cloud Platform Technologies ×2

Distributed Systems ×2

Java

Python

bash

SQL

NoSQL

firewalls

load balancers

DNS

NAT

TLS/SSL

VLANs

JVM

BEHAVIOURAL

teamworkownership mindsetno-blame

Role Details

Seniority mid

Experience 10–15 yrs

Level Lead

Work Mode No

Type FULL TIME

Education BS/MS degree in Computer Science, Applied Math, or related f

AI-Extracted Insights

Domain Areas

cloud-platform-technologiesdistributed-systems-at-scale-in-production

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →