Replit

Engineering

SeniorSiteReliabilityEngineer

Remote - Europe FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Site Reliability Engineer at Replit. Skills: Site Reliability Engineering, automation, infrastructure as code, observability, incident management, distributed systems, Kubernetes. Design and Implement Observability Solutions. Create dashboards and metrics that provide real-time visibility into system health and performance”

What You'll Achieve.

ensure the reliability, scalability, and performance of Replit's infrastructure; enable our platform to scale efficiently while maintaining high availability; design and implement robust monitoring solutions; automate operational tasks; continuously improve our infrastructure's reliability and performance; provide real-time visibility into system health and performance; enable reliable and consistent deployments; automatically respond to common failure scenarios; maintain high reliability standards; prevent future occurrences; reduce Mean Time To Recovery (MTTR); resolve performance bottlenecks; optimize resource utilization; reducing latency and improving system efficiency

Industry & Context.

Engineering

Problems you'll solve

Ability to approach complex operational challenges systematically and devise effective solutions

What They're Looking For.

Must Have

4-8 years of experience in Site Reliability Engineering or similar roles (DevOps, Systems Engineering, Infrastructure Engineering), programming skills in languages commonly used for automation (Python, Go, or similar), Deep understanding of distributed systems, Experience with container orchestration platforms (Kubernetes) and cloud-native technologies, Proven track record of implementing and maintaining monitoring/observability solutions, incident management skills with experience leading incident response, Experience with infrastructure as code and configuration management tools

Nice to Have

Experience with Google Cloud Platform (GCP) services and tools, Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc. )

What You'll Do.

Design and Implement Observability Solutions

Create dashboards and metrics that provide real-time visibility into system health and performance

Implement logging strategies that enable quick problem identification and resolution

Drive Automation and Infrastructure as Code

Architect and implement infrastructure automation solutions using tools like Terraform

Design and maintain CI/CD pipelines that enable reliable and consistent deployments

Create self-healing systems that can automatically respond to common failure scenarios

Establish SLOs and SLIs

Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

Build systems to track and report on these metrics

ensuring we maintain high reliability standards while balancing innovation speed

Incident Management and Response

Lead incident response efforts

conducting thorough post-mortems

and implementing improvements to prevent future occurrences

Develop and maintain runbooks for critical services

Build tools and processes that reduce Mean Time To Recovery (MTTR)

Performance Optimization

Identify and resolve performance bottlenecks across our infrastructure

Implement capacity planning strategies and optimize resource utilization

Work on reducing latency and improving system efficiency across global regions

How You'll Work.

Team & Collaboration

collaborating effectively with cross-functional teams; Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

Communication Scope

Ability to explain complex technical concepts to both technical and non-technical audiences

Full Job Description

Replit is the agentic software creation platform that enables anyone to build applications using natural language. With millions of users worldwide, Replit is democratizing software development by removing traditional barriers to application creation. ABOUT THE ROLE: Join our Site Reliability Engineering team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability. We are seeking SREs who are passionate about building and maintaining resilient systems at scale. Your mission will be to design and implement robust monitoring solutions, automate operational tasks, and continuously improve our infrastructure's reliability and performance. YOU WILL: - Design and Implement Observability Solutions: Develop comprehensive monitoring and alerting systems using modern observability tools. Create dashboards and metrics that provide real-time visibility into system health and performance. Implement logging strategies that enable quick problem identification and resolution. - Drive Automation and Infrastructure as Code: Architect and implement infrastructure automation solutions using tools like Terraform, Ansible, or Pulumi. Design and maintain CI/CD pipelines that enable reliable and consistent deployments. Create self-healing systems that can automatically respond to common failure scenarios. - Establish SLOs and SLIs: Work with product and engineering teams to define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs). Build systems to track and report on these metrics, ensuring we maintain high reliability standards while balancing innovation speed. - Incident Management and Response: Lead incident response efforts, conducting thoroug

Free ATS check

Applying for this Senior Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 46 detected · ranked by frequency

Kubernetes ×4

Site Reliability Engineering ×3

automation ×3

infrastructure as code ×3

observability ×3

incident management ×3

programming skills in languages commonly used for automation (Python, Go, or similar) ×3

Deep understanding of distributed systems ×3

Experience with container orchestration platforms (Kubernetes) and cloud-native technologies ×3

Proven track record of implementing and maintaining monitoring/observability solutions ×3

incident management skills with experience leading incident response ×3

Experience with infrastructure as code and configuration management tools ×3

Experience with Google Cloud Platform (GCP) services and tools ×3

Knowledge of modern observability platforms (Prometheus, Grafana, Datadog, etc. ) ×3

distributed systems ×2

Terraform ×2

Ansible ×2

Pulumi ×2

Prometheus ×2

Grafana ×2

Datadog ×2

Python

DevOps

Systems Engineering

Infrastructure Engineering

configuration management

performance optimization

capacity planning

cloud-native technologies

container orchestration

monitoring

BEHAVIOURAL

Problem-solving mindsetSelf-directedautonomouscollaborating effectively with cross-functional teamsContinuous learningPassion for staying current with industry best practices and new technologiesbelief in automating repetitive tasks and building self-healing systems

Role Details

Experience 4–8 yrs

Level Senior

Work Mode Remote

Type FULL TIME

Category sre

AI-Extracted Insights

Domain Areas

distributed-systemscloud-native-technologies

How to Apply on Ashby

Ashby is a fast modern ATS — most applications take under 3 minutes.
The resume parser is strong; verify parsed experience dates and job titles.
Custom screening questions are often scored algorithmically — answer completely.
Location field affects geo-based screening; use your actual metro area.

ANONYMOUS · UNFILTERED

What do employees actually say about Replit?

Real rants from real employees. Read before you apply.

Read Company Rants →