Company
DevOps
SiteReliabilityEngineer(SRE)
Neural analysis suggests this role is
optimal for Mid candidates.
“Site Reliability Engineer (SRE). Skills: Site Reliability Engineering, DevOps, Cloud Infrastructure, Observability. Design monitoring systems. Maintain monitoring systems”
Industry & Context.
Troubleshooting; Root-cause analysis; Anomaly detection
On-call rotations
What They're Looking For.
Must Have
3+ years of experience, Hands-on experience with AWS, Solid Linux administration, Experience with Docker, Experience with Kubernetes, Experience with Terraform, Proficiency in Python, Proficiency in Bash, Understanding of networking fundamentals, Understanding of infrastructure security, Experience supporting production systems, Participating in incident response
Nice to Have
Experience operating edge computing, Experience operating IoT deployments, Familiarity with zero-trust access, Experience in security operations, Experience in threat detection, Experience in infrastructure security, Exposure to AI infrastructure, Exposure to LLM-based applications, Exposure to workflow automation, Knowledge of AI-Ops, Knowledge of anomaly detection, Knowledge of intelligent monitoring, Familiarity with ISO 27001
What You'll Do.
Design monitoring systems
Maintain monitoring systems
Design alerting systems
Maintain alerting systems
Design dashboarding systems
Maintain dashboarding systems
Build visibility into metrics
Build visibility into logs
Build visibility into traces
Build visibility into analytics
Define reliability targets
Manage reliability targets
Develop proactive monitoring
Develop anomaly detection
Deploy containerized workloads
Manage containerized workloads
Optimize containerized workloads
Maintain cloud infrastructure
Improve system performance
Improve system availability
Improve operational efficiency
Support infrastructure provisioning
Implement access controls
Implement audit mechanisms
Monitor cybersecurity threats
Monitor unauthorized access
Monitor service disruptions
Develop alerting procedures
Develop response procedures
Contribute to best practices
Automate operational tasks
Support CI/CD workflows
Support deployment automation
Promote documentation
Promote operational standards
Promote continuous improvement
Participate in on-call
Participate in incident management
Lead troubleshooting efforts
Conduct root-cause analysis
Conduct post-mortem reviews
Drive long-term improvements
Work closely with software teams
Work closely with AI teams
Work closely with ML teams
Work closely with hardware teams
Work closely with product teams
Ensure services are production-ready
Support cloud environments
Support edge computing environments
How You'll Work.
Team & Collaboration
Cross-functional teams; Software teams; AI teams; Machine learning teams; Hardware teams; Product teams
Full Job Description
## What you will do Reliability & Observability Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments. Build visibility into system health through metrics, logs, traces, and performance analytics. Define and manage SLIs, SLOs, and service reliability targets. Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users. Cloud Infrastructure & Platform Operations Deploy, manage, and optimize containerized workloads running on Kubernetes. Maintain scalable cloud infrastructure across production environments. Improve system performance, availability, and operational efficiency. Support infrastructure provisioning through Infrastructure-as-Code practices. Security & Access Management Implement secure access controls and audit mechanisms across infrastructure environments. Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions. Develop alerting and response procedures for security-related incidents. Contribute to operational security best practices and governance initiatives. Automation & Engineering Excellence Automate repetitive operational tasks to reduce manual effort and improve reliability. Build tooling and scripts to streamline infrastructure operations. Support CI/CD workflows and deployment automation. Promote documentation, operational standards, and continuous improvement. Incident Response & Reliability Engineering Participate in on-call rotations and incident management. Lead troubleshooting efforts during production incidents. Conduct root-cause analysis and post-mortem reviews. Drive long-term improvements that enhance system resilience. Cross-Functional Collaboration Work closely with software, AI, machine learning, hardware, and product teams. Ensure new services are production-ready with appropriate monitoring, security, and reliability measures. Support the operational needs of both cloud-based and distributed edge computing
Applying for this Site Reliability Engineer (SRE) role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.