Omilia

Information Technology and Services

SeniorSiteReliabilityEngineer

₹25–45L ~AI est. Remote FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Site Reliability Engineer at Omilia. Skills: Site Reliability Engineering, Cloud platform, Observability, Automation. Ensure platform reliability and availability. Proactive monitoring”

What You'll Achieve.

Ensure platform reliability; Ensure platform availability; Improve alert quality; Improve response processes; Continuous improvement

Industry & Context.

Information Technology and Services
Problems you'll solve

Problem management; Root cause analysis; Troubleshooting; Anticipating challenges

Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

Bachelor's Degree or MS, Experience operating container orchestration cluster, Experience developing/maintaining software, Experience with ELK, Experience with AWS, Experience with Grafana/Prometheus stack, Scripting skills (Bash, Python or Go)

Nice to Have

Telephony knowledge (SIP, VoIP), Experience in Linux Administration, Working knowledge in Configuration Management tools, Experience with TCP/IP, RDBMS knowledge (MySQL, Postgres), NoSQL knowledge (Redis)

What You'll Do.

Ensure platform reliability and availability

First response for incidents

Contribute to problem management

Contribute to root cause analysis

Support development team's reliability efforts

Create reliability culture

Develop troubleshooting documentation

Collaborate with Engineering teams

Develop optimised runbooks

Develop operational documentation

Automate operational tasks

Collaborate with development teams

Collaborate with cloud engineering teams

Embed reliability into lifecycle

Embed performance into lifecycle

Design observability solutions

Implement observability solutions

Evolve observability solutions

Participate in on-call rotations

Improve alert quality

Improve response processes

Champion reliability culture

Champion performance culture

Champion continuous improvement culture

How You'll Work.

Team & Collaboration

Collaborate with team members; Collaborate with Engineering teams; Collaborate with development teams; Collaborate with cloud engineering teams; Involving product; Involving experience design; Involving engineering

Communication Scope

Excellent communication skills

Process & Methodology

Agile, Lean methods

Full Job Description

We are looking for a **Senior Site Reliability Engineer** with Cloud platform experience. This individual will be part of a team responsible for operating and maintaining production clusters and developing our observability solutions; they will collaborate with team members to develop automation strategies, monitoring & alerting, and ensuring overall platform reliability. Your goal will be to become an integral part of the team, making every challenge of the platform – your own challenge, and solving them accordingly. **Responsibilities** * Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation. * First response for incidents, contribute to problem management and root cause analysis. * Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle. * Develop troubleshooting documentation for production support resources. * Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks. * Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle. * Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK. * Participate in on-call rotations and continuously improve alert quality and response processes. * Champion a culture of reliability, performance, and continuous improvement across teams. **Requirements** * Bachelor's Degree or MS in Engineering or equivalent. * Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm). * Experience developing or maintaining software for production services at scale. * Experience with ELK. * Experience with AWS. * Experience with Grafana/Prometheus stack. * Strong scripting skills (Bash, Python or Go).

Free ATS check

Applying for this Senior Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

ANONYMOUS · UNFILTERED

What do employees actually say about Omilia?

Real rants from real employees. Read before you apply.

Read Company Rants →