Thoughtworks

SeniorServiceReliabilityEngineer

Singapore

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Service Reliability Engineer at Thoughtworks. Skills: Site Reliability Engineering, automation, monitoring, incident response, observability, reliability, resilience, system performance, container orchestration, cloud platforms. improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect. drive the integration of observability automation into the CI/CD pipeline”

What You'll Achieve.

meet and exceed reliability and business objectives; achieve application availability with minimum/no disruption (99.999%); reduce unnecessary toil; improve process efficiency

Industry & Context.

Problems you'll solve

solving challenging problems; debugging difficult issues; analyzing logs; identifying root causes

Eligibility Requirements

Singapore Citizen or PR of Singapore, rotation- and need-based 24x7 available team

What They're Looking For.

Must Have

hands-on experience in programming and scripting languages such as Python, Go or Bash, good understanding of at least one Public Cloud, e.g.: AWS, Azure or GCP, exposure to observability tools such as Grafana, Datadog, NewRelic, ELK Stack, Dynatrace or equivalent, proficient in using data from these tools to dissect and identify root causes of system and infrastructure issues, familiar with DevOps and GitOps practices, good knowledge of container-based architecture and orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc., understand technical architecture and modern design patterns, including microservices, serverless functions, NoSQL and RESTful APIs, experience in fixing bugs, analyzing logs, building metrics and operational dashboards, familiar with creating infrastructure resources for improving reliability of system that follows Cloud’s Well Architected Framework principles: Reliability, security, cost optimization, performance efficiency and operational, communication and articulation skills, proficient in English, good people skills with an emphasis on negotiation and close collaboration with multiple cross-functional teams from the client side and/or Thoughtworks, solve challenging problems and difficult to debug issues with a never give up attitude, ability to work under pressure and with composure during production incidents, confidently recommend improvements backed by technical arguments to client stakeholders or application development teams, able to understand requirements provided by the client on both technical and business aspects and break them down for successful implementation, drive and ownership mentality, with a willingness to sign up for and deliver work when called upon, without being too concerned about role boundaries, willing to be part of a rotation- and need-based 24x7 available team

What You'll Do.

improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect

drive the integration of observability automation into the CI/CD pipeline

handle production incidents

manage incident communication with clients

draft root cause analysis documents

monitor performance of production systems

improve system scaling to ensure business goals are met within expected SLA and SLO metrics

improve system observability across multiple facets such as logging and metrics

implement chaos engineering practices

setting up processes for chaos testing

How You'll Work.

Team & Collaboration

close collaboration with multiple cross-functional teams from the client side and/or Thoughtworks; work closely with application development teams as advisors on improving system reliability; assisting in implementation for reliability improvements

Communication Scope

communication; articulation; English proficiency; client communication; technical argument recommendation

Full Job Description

Due to the nature of the projects and specific client security clearance, this role requires the successful candidate to be a Singapore Citizen or PR of Singapore As a Senior Service Reliability Engineer (SRE) you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering. By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach. Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives. Job responsibilities You will improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect. You will drive the integration of observability automation into the CI/CD pipeline. You will handle production incidents, manage incident communication with clients and draft root cause analysis documents. You will monitor performance of production systems and improve their scaling to ensure business goals are met within expected SLA and SLO metrics. You will work closely with application development teams as advisors on improving system reliability and assisting in implementation for reliability improvements. You will improve system observability across multiple facets such as logging and metrics, reducing false alarms to eliminate unnecessary toil and improving process efficiency. You will implement chaos engineering practices as necessary to test system reliability, setting up processes for such testing to be done regularly . You have a clear understanding of client goals and business needs and setting direction for site reliability

Free ATS check

Applying for this Senior Service Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 46 detected · ranked by frequency

automation ×5

monitoring ×5

incident response ×5

observability ×5

container orchestration ×5

programming ×3

scripting ×3

CI/CD integration ×3

performance monitoring ×3

system scaling ×3

logging ×3

metrics ×3

chaos engineering ×3

infrastructure resource creation ×3

Site Reliability Engineering ×2

reliability ×2

resilience ×2

system performance ×2

cloud platforms ×2

Grafana ×2

Datadog ×2

NewRelic ×2

ELK Stack ×2

Dynatrace ×2

Kubernetes ×2

AWS EKS ×2

Docker Swarm ×2

Nomad ×2

Python

Bash

AWS

BEHAVIOURAL

collaborationproblem-solvingnever give up attitudeworking under pressurecomposureownership mentalityadaptability

Role Details

Experience 5–10 yrs

Level Senior

Category 1z_twsgp-sl-damo

AI-Extracted Insights

Domain Areas

client-goalsbusiness-needsapplication-availabilityslaslo-metrics

ANONYMOUS · UNFILTERED

What do employees actually say about Thoughtworks?

Real rants from real employees. Read before you apply.

Read Company Rants →