Omilia
Information Technology and Services
SeniorSiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability Engineer at Omilia. Skills: Site Reliability Engineering, Cloud platform, Observability, Automation. Ensure platform reliability. Ensure platform availability”
Industry & Context.
Problem management; Root cause analysis; Troubleshooting
On-call rotations
What They're Looking For.
Must Have
Bachelor's Degree or MS, Container orchestration cluster experience, Software for production services experience, ELK experience, AWS experience, Grafana/Prometheus stack experience, Bash, Python or Go scripting skills, Excellent communication skills, Thinking out of the box, Versatility, Being a team player
Nice to Have
Telephony knowledge, Linux Administration experience, Configuration Management tools experience, TCP/IP and networking knowledge, RDBMS knowledge, NoSQL knowledge
What You'll Do.
Ensure platform reliability
Ensure platform availability
First response for incidents
Contribute to problem management
Contribute to root cause analysis
Support development team reliability efforts
Create reliability culture
Develop troubleshooting documentation
Collaborate with Engineering teams
Develop optimised runbooks
Develop productive runbooks
Automate operational tasks
Collaborate with development teams
Collaborate with cloud engineering teams
Embed reliability into lifecycle
Embed performance into lifecycle
Design observability solutions
Implement observability solutions
Evolve observability solutions
Participate in on-call rotations
Improve alert quality
Improve response processes
Champion reliability culture
Champion performance culture
Champion continuous improvement
How You'll Work.
Team & Collaboration
Collaborate with team members; Collaborate with Engineering teams; Collaborate with development teams; Collaborate with cloud engineering teams; Involving product; Involving experience design; Involving engineering
Communication Scope
Excellent communication skills
Process & Methodology
Agile, Lean methods
Full Job Description
We are looking for a **Senior Site Reliability Engineer** with Cloud platform experience. This individual will be part of a team responsible for operating and maintaining production clusters and developing our observability solutions; they will collaborate with team members to develop automation strategies, monitoring & alerting, and ensuring overall platform reliability. Your goal will be to become an integral part of the team, making every challenge of the platform – your own challenge, and solving them accordingly. **Responsibilities** * Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation. * First response for incidents, contribute to problem management and root cause analysis. * Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle. * Develop troubleshooting documentation for production support resources. * Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks. * Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle. * Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK. * Participate in on-call rotations and continuously improve alert quality and response processes. * Champion a culture of reliability, performance, and continuous improvement across teams. **Requirements** * Bachelor's Degree or MS in Engineering or equivalent. * Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm). * Experience developing or maintaining software for production services at scale. * Experience with ELK. * Experience with AWS. * Experience with Grafana/Prometheus stack. * Strong scripting skills (Bash, Python or Go).
Applying for this Senior Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
ANONYMOUS · UNFILTERED
What do employees actually say about Omilia?
Real rants from real employees. Read before you apply.