Us
AI
Director,AIOperations
Neural analysis suggests this role is
optimal for Director candidates.
“Director, AI Operations at Us. Skills: AI Operations, SRE, Observability, Automation. Lead and evolve operations model. Enforce operational readiness gates”
What You'll Achieve.
increase first-line resolution; eliminate manual toil; free engineering time for higher-value work; operational readiness gates; automation coverage; toil budget compliance; materially reduces operational toil; accelerates patient impact
Industry & Context.
resolve root causes; translate findings into preventative engineering
What They're Looking For.
Must Have
BSc/MSc/PhD in Computer Science or a related analytical field, Demonstrable recent experience building and leading large-scale SRE or platform operations functions hands-on, Deep technical knowledge of platforms such as Datadog, New Relic, Grafana, or Splunk, covering dashboard development, alerting strategies, and telemetry pipeline architecture, Solid understanding of OpenTelemetry, distributed tracing, and structured logging, Consistent track record of designing and implementing automation that materially reduces operational toil, grasp of Azure and/or AWS, including container orchestration, serverless architectures, and managed services, Proven ability to run post-mortem processes and translate findings into preventative engineering, alongside experience defining platform handover criteria, Capability to provide precise technical direction on system instrumentation, alert triggers, and automation interventions while developing high-performing teams
Nice to Have
Application of AI/ML to operational challenges, Experience operating platforms serving AI/ML workloads such as LLM inference, model serving, and data pipelines, Familiarity with regulated pharmaceutical environments or the AstraZeneca technology estate, ITIL, SRE, or operational excellence certifications
What You'll Do.
Lead and evolve operations model
Enforce operational readiness gates
Establish instrumentation and alerting standards
Guide architecture of AI-augmented tooling
Mandate incident yields runbook or automation
Define acceptable toil thresholds
Build talent pipeline
Partner with product engineering
Communicate platform health
How You'll Work.
Team & Collaboration
Partner with product engineering to co-own post-mortems; Negotiate handovers with product engineering; Embed operational requirements into architecture decisions; Communicate platform health, risks, and investments clearly to senior leadership
Communication Scope
Communicate platform health, risks, and investments clearly to senior leadership using data-driven narratives
Full Job Description
## **Introduction to the Role** Transform AI into a true force multiplier for enterprise operations! This role advises how machine learning and artificial intelligence platforms are run, automated, and improved across Azure and AWS to support critical scientific and business outcomes. The focus is on defining system architecture, monitoring, and automation, ensuring operations itself becomes a benchmark for AI adoption. Managing a layered operations framework—spanning L1 runbook operators, L2 site reliability engineers (SREs), and an L3 product engineering interface—this position establishes continuous improvement. The goal is to eliminate manual toil, increase first-line resolution, and free engineering time for higher-value work. Every incident must become a detailed procedure, an automated process, or a permanent fix. This is not standard support; it is a leadership role for an engineering-minded operator setting precise technical direction! ## **Accountabilities** * **Operational Model Ownership** : Lead and evolve the three-tier operations model for the AI/ML platform estate. Enforce operational readiness gates and run monthly reviews using clear metrics: L1 resolution rate, repeat incident rate, automation coverage, and toil budget compliance. * **Technical Direction** : Establish instrumentation and alerting standards for a centralised observability layer (Datadog, New Relic, Grafana, Splunk). Guide the architecture of AI-augmented operations tooling, including conversational runbooks, and direct L2 SRE patch contributions to resolve root causes. * **Automation Strategy** : Mandate that every incident yields a runbook, an automation, or a patch. Define acceptable toil thresholds and prioritise automation investments by incident frequency, resolution time, and blast radius. * **Team Leadership and Development** : Direct the L2 SRE team aligned to cloud domains. Build a robust talent pipeline from L1 to L2, fostering a culture where SREs operate as engineers de
Applying for this Director, AI Operations role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Us?
Real rants from real employees. Read before you apply.