Universal Music Group

Music

ServiceReliabilityEngineer

$1–1k Nashville, Tennessee, United States

The Brief

“Service Reliability Engineer at Universal Music Group. Skills: Service Reliability Engineering, Cloud Platforms (AWS preferred), Automation, Monitoring and Observability, Infrastructure as Code, Containerization. Design, build, and maintain the availability, scalability, and performance of critical services. Develop and maintain robust monitoring, alerting, and observability systems”

What You'll Achieve.

Improve system reliability; Automate complex processes; Reduce manual toil; Ensure services are always on; Rapid issue detection and resolution; Service delivery improvement; Improve deployment speed and reliability; Implement lasting solutions

Industry & Context.

Music

Problems you'll solve

Proven analytical and problem-solving abilities

Eligibility Requirements

On-call rotation

What They're Looking For.

Must Have

Background in systems administration (Linux/Windows) in a large-scale environment, Proficiency in at least one programming language (e.g., Python, Go, Java), Hands-on experience with a major cloud platform (AWS, GCP, or Azure), Solid understanding of networking, Solid understanding of containers (Docker, Kubernetes), Solid understanding of Infrastructure as Code (e.g., Terraform, Ansible), Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace), Proven analytical and problem-solving abilities with experience in a high-pressure environment, Excellent communication skills

Nice to Have

Bachelor's degree in an IT-related field, Experience managing large-scale, distributed systems for a global organization, Familiarity with IT governance standards like ITIL, Direct experience with ServiceNow for IT service management, Knowledge of chaos engineering, Knowledge of resilience testing, Knowledge of advanced capacity planning

What You'll Do.

and maintain the availability

and performance of critical services

Develop and maintain robust monitoring

and observability systems

Monitor infrastructure capacity and performance

Drive the automation of repetitive operational tasks

Create and maintain scripts and custom code

Support and optimize CI/CD pipelines

Participate in an on-call rotation to troubleshoot and mitigate production incidents

Lead post-incident reviews and root cause analyses

Partner with engineering and IT stakeholders to embed SRE best practices (SLOs

error budgets) into the design and development lifecycle

How You'll Work.

Team & Collaboration

Essential partner to development, infrastructure, and security teams; Partner with engineering and IT stakeholders; Foster a collaborative team environment

Communication Scope

Excellent communication skills

Free ATS check

Applying for this Service Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about Universal Music Group?

Real rants from real employees. Read before you apply.

Read Company Rants →