Company
Technology
SiteReliabilityEngineer(SRE)
Neural analysis suggests this role is
optimal for Senior candidates.
“Site Reliability Engineer (SRE). Skills: Reliability Engineering, Automation, Observability, Incident Management. Build reliable systems. Maintain reliable systems”
Industry & Context.
Structured problem-solving; Analytical thinking; Troubleshooting
On-call coordination
What They're Looking For.
Must Have
Deep infrastructure understanding, Solid automation skills, Proactive mindset, Reliability and scalability focus, Experience as SRE/DevOps/Backend/Platform Engineer, Knowledge of Kubernetes, Knowledge of Docker, Knowledge of cloud-native architectures, Solid experience with observability tools, Understanding of Linux systems, Understanding of networking, Understanding of HTTP, Understanding of DNS, Understanding of TLS/SSL, Proficiency in Python scripting, Proficiency in Shell scripting, Experience with distributed systems, Experience with incident management, Experience with troubleshooting, Familiarity with CI/CD pipelines, Familiarity with infrastructure automation, Familiarity with Git workflows, Analytical thinking, Autonomy, Structured problem-solving skills, Clear communication skills, Ability to collaborate across engineering teams
Nice to Have
Knowledge of reliability engineering concepts, Experience with high-availability systems, Experience with production-scale environments, Familiarity with AIOps, Familiarity with OpenTelemetry, Familiarity with chaos engineering
What You'll Do.
Build reliable systems
Maintain reliable systems
Improve operational maturity
Define reliability standards
Lead incident management practices
Drive automation initiatives
Reduce operational toil
Increase system resilience
Operate with error budget principles
Design high availability strategies
Implement high availability strategies
Design disaster recovery strategies
Implement disaster recovery strategies
Design resilience strategies
Implement resilience strategies
Build observability platforms
Evolve observability platforms
Lead incident response processes
Coordinate on-call rotations
Manage escalation flows
Perform root cause analysis
Conduct post-mortem reviews
Implement preventive actions
Optimize system performance
Perform capacity planning
Analyze infrastructure
Drive automation solutions
Implement self-healing solutions
Eliminate repetitive operational tasks
Apply AI-driven approaches
Perform anomaly detection
Collaborate with development teams
Improve system reliability
Improve deployment safety
Ensure security best practices
Ensure compliance best practices
Ensure operational best practices
How You'll Work.
Team & Collaboration
Across engineering teams
Communication Scope
Clear communication
Full Job Description
## Accountabilities In this role, you will be responsible for building and maintaining highly reliable systems while continuously improving operational maturity across engineering teams. You will define reliability standards, lead incident management practices, and drive automation initiatives that reduce operational toil and increase system resilience. Define and track SLI, SLO, and SLA metrics, operating with error budget principles Design and implement high availability, disaster recovery, and resilience strategies (RTO/RPO) Build and evolve observability platforms (logs, metrics, traces, alerts, dashboards) Lead incident response processes, including on-call coordination and escalation flows Perform root cause analysis (RCA) and post-mortem reviews with preventive actions Optimize system performance through capacity planning, tuning, and infrastructure analysis Drive automation and self-healing solutions to eliminate repetitive operational tasks Apply AI-driven approaches (AIOps) for anomaly detection, log analysis, and troubleshooting Collaborate with development teams to improve system reliability and deployment safety Ensure security, compliance, and operational best practices in production environments Requirements: We are looking for a strong technical profile with deep infrastructure understanding, solid automation skills, and a proactive mindset focused on reliability and scalability. Experience as an SRE, DevOps, or Backend/Platform Engineer in production environments Strong knowledge of Kubernetes, Docker, and cloud-native architectures Solid experience with observability tools (Grafana, Prometheus, ELK, Datadog, or similar) Strong understanding of Linux systems, networking, HTTP, DNS, and TLS/SSL Proficiency in scripting/automation using Python, Shell, or similar languages Experience with distributed systems, incident management, and troubleshooting Familiarity with CI/CD pipelines, infrastructure automation, and Git workflows Knowledge of reliability
Applying for this Site Reliability Engineer (SRE) role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.