Eli Lilly and Company
Healthcare
PrincipalPlatformReliabilityEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Principal Platform Reliability Engineer at Eli Lilly and Company. Skills: Platform Reliability, Site Reliability Engineering, Cloud Environments, Observability. Define and implement SLOs. Implement reliability standards”
Industry & Context.
Root cause analysis; Troubleshooting complex issues; Troubleshooting performance issues
In office 3 days a week
What They're Looking For.
Must Have
Bachelor's degree in Computer Science, 7+ years of hands-on experience with AWS, Extensive experience with Kubernetes, Experience operating distributed systems, Experience in incident management, Experience defining SLOs, Hands-on experience with observability tools, Experience building CI/CD pipelines, Proficient Experience in Infrastructure as Code tools, Experience with scripting in Python, Experience with networking fundamentals, Experience with cloud architecture fundamentals, Experience implementing security best practices, Experience troubleshooting complex issues
Nice to Have
Experience with ArgoCD, Experience with GitHub Actions, Familiarity with large-scale enterprise platforms, Experience in regulated industries, Exposure to global support models, Written communication skills
What You'll Do.
Define and implement SLOs
Implement reliability standards
Drive resilience through capacity planning
Design failover strategies
Design disaster recovery strategies
Lead response for P1/P2 incidents
Mitigate incidents rapidly
Recover systems rapidly
Conduct root cause analysis
Implement corrective actions
Develop operational standards
Maintain operational standards
Implement observability frameworks
Optimize observability frameworks
Improve system visibility
Leverage platforms for telemetry
Build CI/CD pipelines
Maintain CI/CD pipelines
Drive adoption of IaC
Drive adoption of GitOps
Support SRE principles integration
Implement secure-by-design practices
Support vulnerability remediation
Ensure secure configurations
Align with security standards
Align with compliance standards
Partner with engineering teams
Improve platform reliability
Improve platform performance
Improve deployment practices
Provide technical guidance
Provide mentorship to engineers
Communicate system health
Communicate incident impact
How You'll Work.
Team & Collaboration
Partner with engineering teams; Technical guidance to engineers; Communicate with stakeholders
Communication Scope
Incident updates; Postmortems; Status summaries
Process & Methodology
GitOps practices
Full Job Description
At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world. Eli Lilly and Company seeks a Platform Site Reliability Engineer to join the Software Product Engineering (SPE) Customer Operations team. You will design, operate, and continuously improve highly available, scalable, and fault-tolerant systems across cloud environments. You will play a critical role in establishing reliability standards, driving operational excellence, and enabling engineering teams to build and deploy with confidence. **What You’ll Do:** * Define and implement SLOs, SLIs, and reliability standards that establish a consistent foundation for platform health, driving resilience through capacity planning, failover design, and disaster recovery strategies * Lead response for P1/P2 incidents, owning rapid mitigation and recovery while conducting thorough root cause analysis and implementing corrective actions that prevent recurrence * Develop and maintain runbooks, playbooks, and operational standards that enable the broader engineering organization to respond effectively and consistently * Implement and optimize observability frameworks spanning monitoring, logging, tracing, and alerting — improving system visibility and reducing alert noise through actionable, signal-driven insights * Leverage platforms such as Splunk, Prometheus, CloudWatch, or equivalent tooling to ensure teams have the telemetry they need to detect, diagnose, and resolve issues proactively * Build and maintain CI/CD pipelines and deployment au
Applying for this Principal Platform Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Eli Lilly and Company?
Real rants from real employees. Read before you apply.