Monotype
Fonts
Manager,SiteReliabilityEngineering
Neural analysis suggests this role is
optimal for Senior candidates.
“Manager, Site Reliability Engineering at Monotype. Skills: Site Reliability Engineering (SRE), Incident management, Automation, Observability, Team leadership. Own end-to-end reliability of production systems, ensuring uptime within defined SLAs. Lead and govern a 24x7x365 incident management team, ensuring quick response and resolution”
What You'll Achieve.
Ensuring high system availability; Fast incident response; Continuous improvement of platform reliability; Maintaining uptime; Reducing incidents; Improving response times; Building a more proactive and self-sufficient SRE function; Ensuring uptime within defined SLAs; Quick response and resolution; Prevent repeat issues; Reduce alert noise; Improve signal-to-noise ratio; Reduce production issues caused by releases; Ensure visibility, stability, and cost awareness for AI-driven systems; Build team maturity; Reduce dependency on senior members; Develop ownership and accountability within the team; Optimize cloud usage and reduce unnecessary spend; Balance reliability improvements with cost efficiency
Industry & Context.
Structured problem-solving; Analytical and problem-solving skills for handling complex production issues
What They're Looking For.
Must Have
10+ years of experience in SRE with proven experience managing production systems and 24x7 operations teams, hands-on experience with AWS and Kubernetes (EKS preferred), understanding of incident management, RCA, and production support models, Experience with monitoring/observability tools (Datadog, CloudWatch, ELK, Prometheus, Grafana), Experience driving automation and reducing operational toil, Understanding of microservices-based architectures, knowledge of release processes and production readiness practices, understanding of SLAs, SLIs, SLOs, and reliability metrics, Good understanding of cloud cost optimization (FinOps basics), Exposure to or experience supporting AI/ML workloads, leadership skills with experience managing and mentoring teams, Ability to stay calm and lead during high-severity incidents, communication and stakeholder management skills, Structured problem-solving and decision-making ability, analytical and problem-solving skills for handling complex production issues, Understanding of security best practices across infrastructure and applications, Ability to standardize processes and improve operational consistency
Nice to Have
Certification in relevant technologies (e. g. , AWS, Kubernetes) is a plus, Strategic mindset with ability to align reliability initiatives with business goals
What You'll Do.
Own end-to-end reliability of production systems
ensuring uptime within defined SLAs
Lead and govern a 24x7x365 incident management team
ensuring quick response and resolution
Act as escalation point during critical incidents and drive coordination across teams
Ensure proper incident tracking
and status page updates
Drive a blameless RCA culture across the team
Ensure all customer-impacting incidents are analysed with clear root causes
Track and drive closure of RCA action items to prevent repeat issues
Identify recurring patterns and push for permanent fixes
Own and improve observability using tools like Datadog
Guide teams on effective logging
and monitoring practices
Reduce alert noise and improve signal-to-noise ratio
Drive proactive monitoring and early detection of issues
Drive automation to reduce manual effort and operational toil
Identify repetitive issues and build solutions to eliminate them
Ensure runbooks and playbooks are created and followed for recurring incidents
Engineering & Platform teams to improve release quality and stability
Ensure proper readiness checks before production deployments (monitoring
Reduce production issues caused by releases
Support reliability and monitoring of AI/ML workloads in production and experimentation environments
and cost awareness for AI-driven systems
Bring structure and best practices as AI adoption grows
Lead and mentor a team of ~14 engineers across operations and SRE excellence
Build team maturity and reduce dependency on senior members
Develop ownership and accountability within the team
Partner with teams to optimize cloud usage and reduce unnecessary spend
Balance reliability improvements with cost efficiency
Ensure security best practices are followed across infrastructure and applications in collaboration with security teams
How You'll Work.
Team & Collaboration
Lead and mentor a team of ~14 engineers across operations and SRE excellence; Work closely with Engineering, Product and Platform teams; Ensure smooth coordination during incidents and releases; Communicate effectively with stakeholders during high-severity situations; Collaborate with stakeholders to align reliability and platform strategies with business goals; Ensure security best practices are followed across infrastructure and applications in collaboration with security teams
Communication Scope
Communicate effectively with stakeholders during high-severity situations
Process & Methodology
Manage production systems, Manage 24x7 operations teams, Drive automation, Reduce operational toil, Improve release quality and stability, Ensure production readiness, Build team maturity, Develop ownership and accountability, Optimize cloud usage, Balance reliability improvements with cost efficiency
Full Job Description
Are you our “** _TYPE_** ”? **Monotype Global** Named "One of the Most Innovative Companies in Design" by Fast Company, Monotype brings brands to life through type and technology that consumers engage with every day. The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman and more. Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world. We work with the biggest global brands, and with individual creatives, offering a wide set of solutions that make it easier for them to do what they do best: **design beautiful brand experiences.** **Monotype Solutions India** Monotype Solutions India is a strategic center of excellence for Monotype and is a certified Great Place to Work® three years in a row. The focus of this fast-growing center spans Product Development, Product Management, Experience Design, User Research, Market Intelligence, Research in areas of Artificial Intelligence and Machine learning, Innovation, Customer Success, Enterprise Business Solutions, and Sales. Headquartered in the Boston area of the United States and with offices across 4 continents, Monotype is the world’s leading company in fonts. It’s a trusted partner to the world’s top brands and was named “One of the Most Innovative Companies in Design” by Fast Company. Monotype brings brands to life through the type and technology that consumers engage with every day. The company's rich legacy includes a library that can be traced back hundreds of years, featuring famed typefaces like Helvetica, Futura, Times New Roman, and more. Monotype also provides a first-of-its-kind service that makes fonts more accessible for creative professionals to discover, license, and use in our increasingly digital world. We are looking for an experienced and hands-on Site Reliability Engineering (SRE) Manage
Applying for this Manager, Site Reliability Engineering role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Monotype?
Real rants from real employees. Read before you apply.