Company

Technology

SiteReliabilityEngineer(SRE)

$210–350k ~AI est. Brazil FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Site Reliability Engineer (SRE). Skills: Reliability Engineering, Automation, Observability, Incident Management. Build reliable systems. Maintain reliable systems”

Industry & Context.

Technology
Problems you'll solve

Structured problem-solving; Analytical thinking; Troubleshooting

Eligibility Requirements

On-call coordination

What They're Looking For.

Must Have

Deep infrastructure understanding, Solid automation skills, Proactive mindset, Reliability and scalability focus, Experience as SRE/DevOps/Backend/Platform Engineer, Knowledge of Kubernetes, Knowledge of Docker, Knowledge of cloud-native architectures, Solid experience with observability tools, Understanding of Linux systems, Understanding of networking, Understanding of HTTP, Understanding of DNS, Understanding of TLS/SSL, Proficiency in Python scripting, Proficiency in Shell scripting, Experience with distributed systems, Experience with incident management, Experience with troubleshooting, Familiarity with CI/CD pipelines, Familiarity with infrastructure automation, Familiarity with Git workflows, Analytical thinking, Autonomy, Structured problem-solving skills, Clear communication skills, Ability to collaborate across engineering teams

Nice to Have

Knowledge of reliability engineering concepts, Experience with high-availability systems, Experience with production-scale environments, Familiarity with AIOps, Familiarity with OpenTelemetry, Familiarity with chaos engineering

What You'll Do.

Build reliable systems

Maintain reliable systems

Improve operational maturity

Define reliability standards

Lead incident management practices

Drive automation initiatives

Reduce operational toil

Increase system resilience

Operate with error budget principles

Design high availability strategies

Implement high availability strategies

Design disaster recovery strategies

Implement disaster recovery strategies

Design resilience strategies

Implement resilience strategies

Build observability platforms

Evolve observability platforms

Lead incident response processes

Coordinate on-call rotations

Manage escalation flows

Perform root cause analysis

Conduct post-mortem reviews

Implement preventive actions

Optimize system performance

Perform capacity planning

Analyze infrastructure

Drive automation solutions

Implement self-healing solutions

Eliminate repetitive operational tasks

Apply AI-driven approaches

Perform anomaly detection

Collaborate with development teams

Improve system reliability

Improve deployment safety

Ensure security best practices

Ensure compliance best practices

Ensure operational best practices

How You'll Work.

Team & Collaboration

Across engineering teams

Communication Scope

Clear communication

Full Job Description

## Accountabilities In this role, you will be responsible for building and maintaining highly reliable systems while continuously improving operational maturity across engineering teams. You will define reliability standards, lead incident management practices, and drive automation initiatives that reduce operational toil and increase system resilience. Define and track SLI, SLO, and SLA metrics, operating with error budget principles Design and implement high availability, disaster recovery, and resilience strategies (RTO/RPO) Build and evolve observability platforms (logs, metrics, traces, alerts, dashboards) Lead incident response processes, including on-call coordination and escalation flows Perform root cause analysis (RCA) and post-mortem reviews with preventive actions Optimize system performance through capacity planning, tuning, and infrastructure analysis Drive automation and self-healing solutions to eliminate repetitive operational tasks Apply AI-driven approaches (AIOps) for anomaly detection, log analysis, and troubleshooting Collaborate with development teams to improve system reliability and deployment safety Ensure security, compliance, and operational best practices in production environments Requirements: We are looking for a strong technical profile with deep infrastructure understanding, solid automation skills, and a proactive mindset focused on reliability and scalability. Experience as an SRE, DevOps, or Backend/Platform Engineer in production environments Strong knowledge of Kubernetes, Docker, and cloud-native architectures Solid experience with observability tools (Grafana, Prometheus, ELK, Datadog, or similar) Strong understanding of Linux systems, networking, HTTP, DNS, and TLS/SSL Proficiency in scripting/automation using Python, Shell, or similar languages Experience with distributed systems, incident management, and troubleshooting Familiarity with CI/CD pipelines, infrastructure automation, and Git workflows Knowledge of reliability

Free ATS check

Applying for this Site Reliability Engineer (SRE) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →