Playon

high school sports

SeniorSiteReliabilityEngineer

Remote FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Site Reliability Engineer at Playon. Skills: Site Reliability Engineering, System Observability, Automation, Cloud Infrastructure Management, CI/CD Pipeline Management, Incident Response. building the tools, automation, and visibility that enable our teams to deliver resilient software at scale. evolve our infrastructure, CI/CD pipelines, observability frameworks, and reliability practices”

What You'll Achieve.

strengthen the reliability, performance, and scalability of our systems; enable our teams to deliver resilient software at scale; establish the foundation for stronger observability across our platforms; broader reliability and performance initiatives; better understand system health; catch issues earlier and respond faster; make reliability checks part of our normal release workflow; aligning the team on what good performance and availability look like; improve how we communicate, coordinate, and follow up during incidents; free up engineers for more impactful work; shape how we measure, monitor, and improve reliability across all services; setting standards; mentoring others; helping engineering teams make data-driven decisions about performance and stability; support high service availability; proactive incident prevention; rapid response to incidents

Industry & Context.

high school sports
Problems you'll solve

problem-solver who approaches reliability as a shared responsibility across engineering; solving real challenges

Eligibility Requirements

Participate in on-call rotations

What They're Looking For.

Must Have

Solid experience in Python, especially for automation, tooling, and data-driven operational tasks, Proficiency in at least one (Java, C++, or Go), understanding of Linux systems, cloud infrastructure (AWS, GCP, or Azure), modern deployment practices (Docker, Kubernetes, Terraform), Experience with CI/CD pipelines, version control, automated testing frameworks, Experience with observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.), log/metric analysis for diagnosing issues, Proven experience facilitating and documenting Critical User Journeys translating them to actionable SLA/SLO for automation, Demonstrated ability to collaborate with cross-functional teams, communicate clearly in high-impact situations, A problem-solver who approaches reliability as a shared responsibility across engineering

Nice to Have

Experience writing or maintaining end-to-end or integration tests for distributed systems, Background in performance testing, capacity planning, or chaos engineering, Contributions to internal developer tooling or reliability-focused frameworks, Exposure to security, compliance, or change management processes in production environments, Relevant certifications, Familiarity with AI-augmented development tools (Claude, Codex) as part of a modern engineering workflow

What You'll Do.

and visibility that enable our teams to deliver resilient software at scale

evolve our infrastructure

observability frameworks

and reliability practices

Assess and improve visibility: Work with engineering teams to review our current dashboards

identify the biggest gaps

and make targeted improvements

Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services

Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes

Clarify what 'reliable' means: Help define initial SLIs and SLOs for a few core user flows

Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate

and follow up during incidents

Reduce manual effort: Automate routine checks and monitoring tasks

Contribute to system observability i. e implementing

and dashboards for better insight and faster recovery

and monitoring solutions to support high service availability

Drive operational excellence through proactive incident prevention

blameless postmortems

and capacity planning

Participate in on-call rotations to support critical services and ensure rapid response to incidents

How You'll Work.

Team & Collaboration

work closely with application engineers, DevOps, and QA teams; Assess and improve visibility: Work with engineering teams; Partner with application and quality engineering teams; Demonstrated ability to collaborate with cross-functional teams; approaches reliability as a shared responsibility across engineering

Communication Scope

communicate clearly in high-impact situations

Full Job Description

## Description Playon is looking for an experienced Senior Site Reliability Engineer to help us strengthen the reliability, performance, and scalability of our systems. This role sits at the intersection of software engineering and operations — focused on building the tools, automation, and visibility that enable our teams to deliver resilient software at scale.   You’ll work closely with application engineers, DevOps, and QA teams to evolve our infrastructure, CI/CD pipelines, observability frameworks, and reliability practices. This is a hands-on engineering role with a strong emphasis on automation, performance analysis, and continuous improvement.   The Outcomes You’ll Deliver:   In the first few months, You'll focus on building a clear understanding of our systems and establishing the foundation for stronger observability across our platforms. As you settle in, your scope will grow to include broader reliability and performance initiatives.   • Assess and improve visibility: Work with engineering teams to review our current dashboards, metrics, and logs, identify the biggest gaps, and make targeted improvements that help us better understand system health. • Tighten monitoring and alerting: Refine alerts and dashboards for the most critical services so we can catch issues earlier and respond faster. • Build observability into delivery: Add instrumentation and telemetry into existing build and deploy processes to make reliability checks part of our normal release workflow. • Clarify what "reliable" means: Help define initial SLIs and SLOs for a few core user flows, aligning the team on what good performance and availability look like. • Streamline incident response: Partner with the Event Commander/on-call rotation to improve how we communicate, coordinate, and follow up during incidents. • Reduce manual effort: Automate routine checks and monitoring tasks to free up engineers for more impactful work. Over time, you'll take on a larger role shaping how we measur

Free ATS check

Applying for this Senior Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about Playon?

Real rants from real employees. Read before you apply.

Read Company Rants →