Playon
high school sports
SeniorSiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Site Reliability Engineer at Playon. Skills: Site Reliability Engineering, System Observability, Automation, Cloud Infrastructure Management, CI/CD Pipeline Management, Incident Response. building the tools, automation, and visibility that enable our teams to deliver resilient software at scale. evolve our infrastructure, CI/CD pipelines, observability frameworks, and reliability practices”
What You'll Achieve.
strengthen the reliability, performance, and scalability of our systems; enable our teams to deliver resilient software at scale; establish the foundation for stronger observability across our platforms; broader reliability and performance initiatives; better understand system health; catch issues earlier and respond faster; make reliability checks part of our normal release workflow; aligning the team on what good performance and availability look like; improve how we communicate, coordinate, and follow up during incidents; free up engineers for more impactful work; shape how we measure, monitor, and improve reliability across all services; setting standards; mentoring others; helping engineering teams make data-driven decisions about performance and stability; support high service availability; proactive incident prevention; rapid response to incidents
Industry & Context.
problem-solver who approaches reliability as a shared responsibility across engineering; solving real challenges
Participate in on-call rotations
What They're Looking For.
Must Have
Solid experience in Python, especially for automation, tooling, and data-driven operational tasks, Proficiency in at least one (Java, C++, or Go), understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), modern deployment practices (Docker, Kubernetes, Terraform), Experience with CI/CD pipelines, version control, automated testing frameworks, Experience with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.), log/metric analysis for diagnosing issues, Proven experience facilitating and documenting Critical User Journeys translating them to actionable SLA/SLO for automation, Demonstrated ability to collaborate with cross-functional teams, communicate clearly in high-impact situations, A problem-solver who approaches reliability as a shared responsibility across engineering
Nice to Have
Experience writing or maintaining end-to-end or integration tests for distributed systems, Background in performance testing, capacity planning, or chaos engineering, Contributions to internal developer tooling or reliability-focused frameworks, Exposure to security, compliance, or change management processes in production environments, Relevant certifications, Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow
What You'll Do.
and visibility that enable our teams to deliver resilient software at scale
evolve our infrastructure
observability frameworks
and reliability practices
Assess and improve visibility: Work with engineering teams to review our current dashboards
identify the biggest gaps
and make targeted improvements
Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services
Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes
Clarify what 'reliable' means: Help define initial SLIs and SLOs for a few core user flows
Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate
and follow up during incidents
Reduce manual effort: Automate routine checks and monitoring tasks
Contribute to system observability i. e implementing
and dashboards for better insight and faster recovery
and monitoring solutions to support high service availability
Drive operational excellence through proactive incident prevention
blameless postmortems
and capacity planning
Participate in on-call rotations to support critical services and ensure rapid response to incidents
How You'll Work.
Team & Collaboration
work closely with application engineers, DevOps, and QA teams; Assess and improve visibility: Work with engineering teams; Partner with application and quality engineering teams; Demonstrated ability to collaborate with cross-functional teams; approaches reliability as a shared responsibility across engineering
Communication Scope
communicate clearly in high-impact situations
Full Job Description
## Description Playon is looking for an experienced Senior Site Reliability Engineer to help us strengthen the reliability, performance, and scalability of our systems. This role sits at the intersection of software engineering and operations — focused on building the tools, automation, and visibility that enable our teams to deliver resilient software at scale. You’ll work closely with application engineers, DevOps, and QA teams to evolve our infrastructure, CI/CD pipelines, observability frameworks, and reliability practices. This is a hands-on engineering role with a strong emphasis on automation, performance analysis, and continuous improvement. The Outcomes You’ll Deliver: In the first few months, You'll focus on building a clear understanding of our systems and establishing the foundation for stronger observability across our platforms. As you settle in, your scope will grow to include broader reliability and performance initiatives. • Assess and improve visibility: Work with engineering teams to review our current dashboards, metrics, and logs, identify the biggest gaps, and make targeted improvements that help us better understand system health. • Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services so we can catch issues earlier and respond faster. • Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes to make reliability checks part of our normal release workflow. • Clarify what "reliable" means: Help define initial SLIs and SLOs for a few core user flows, aligning the team on what good performance and availability look like. • Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate, coordinate, and follow up during incidents. • Reduce manual effort: Automate routine checks and monitoring tasks to free up engineers for more impactful work. Over time, you'll take on a larger role shaping how we measur
Applying for this Senior Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about Playon?
Real rants from real employees. Read before you apply.