Universal Music Group
Music
ServiceReliabilityEngineer
“Service Reliability Engineer at Universal Music Group. Skills: Service Reliability Engineering, Cloud Platforms (AWS preferred), Automation, Monitoring and Observability, Infrastructure as Code, Containerization. Design, build, and maintain the availability, scalability, and performance of critical services. Develop and maintain robust monitoring, alerting, and observability systems”
What You'll Achieve.
Improve system reliability; Automate complex processes; Reduce manual toil; Ensure services are always on; Rapid issue detection and resolution; Service delivery improvement; Improve deployment speed and reliability; Implement lasting solutions
Industry & Context.
Proven analytical and problem-solving abilities
On-call rotation
What They're Looking For.
Must Have
Background in systems administration (Linux/Windows) in a large-scale environment, Proficiency in at least one programming language (e.g., Python, Go, Java), Hands-on experience with a major cloud platform (AWS, GCP, or Azure), Solid understanding of networking, Solid understanding of containers (Docker, Kubernetes), Solid understanding of Infrastructure as Code (e.g., Terraform, Ansible), Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace), Proven analytical and problem-solving abilities with experience in a high-pressure environment, Excellent communication skills
Nice to Have
Bachelor's degree in an IT-related field, Experience managing large-scale, distributed systems for a global organization, Familiarity with IT governance standards like ITIL, Direct experience with ServiceNow for IT service management, Knowledge of chaos engineering, Knowledge of resilience testing, Knowledge of advanced capacity planning
What You'll Do.
and maintain the availability
and performance of critical services
Develop and maintain robust monitoring
and observability systems
Monitor infrastructure capacity and performance
Drive the automation of repetitive operational tasks
Create and maintain scripts and custom code
Support and optimize CI/CD pipelines
Participate in an on-call rotation to troubleshoot and mitigate production incidents
Lead post-incident reviews and root cause analyses
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs
error budgets) into the design and development lifecycle
How You'll Work.
Team & Collaboration
Essential partner to development, infrastructure, and security teams; Partner with engineering and IT stakeholders; Foster a collaborative team environment
Communication Scope
Excellent communication skills
Applying for this Service Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Universal Music Group?
Real rants from real employees. Read before you apply.