RBC

Financial Services

SeniorAI/MLEngineer-SiteReliabilityEngineering

Toronto, Ontario, Canada FULL TIME

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior AI/ML Engineer - Site Reliability Engineering at RBC. Skills: Agentic AI platform development, Software reliability, Resiliency, Intelligent automation systems, Machine Learning. Design and implement end-to-end Agentic AI solutions. Develop intelligent automation frameworks”

What You'll Achieve.

Autonomously prevent incidents; Accelerate response times; Transform how we maintain resilience across enterprise systems; Reduce toil; Measurably reduce toil; Reducing MTTR (Mean Time to Resolve); MTTD (Mean Time to Detect); MTTI (Mean Time to Identify); Improving system reliability

Industry & Context.

Financial Services

Problems you'll solve

Autonomously detect anomalies; Identify root causes; Resolve incidents with minimal human intervention; Continuously improve response strategies; Reduce toil

What They're Looking For.

Must Have

ML engineering background with hands-on experience designing, training, and deploying machine learning models in production environments, Proven expertise in Agentic AI frameworks and tools (LangChain, LangGraph, AutoGen, CrewAI, or similar) and building autonomous, multi-agent systems, Deep understanding of Model Context Protocol (MCP) for enabling AI agents to interact with external systems and data sources, Experience building AI agents with tool-calling capabilities, memory management, and reasoning chains, Proficiency in Python and experience with ML libraries (scikit-learn, TensorFlow, PyTorch, or similar), Working knowledge of containerization (Docker), orchestration (Kubernetes/OpenShift), and infrastructure-as-code principles (Ansible, Terraform), Demonstrated ability to translate complex technical concepts into business value and collaborate effectively with cross-functional teams

Nice to Have

Prior experience in Site Reliability Engineering, DevOps, or infrastructure monitoring roles, Familiarity with observability tools (Prometheus, Grafana, ELK stack) and incident management platforms (PagerDuty, ServiceNow), Experience with LLMs, prompt engineering, and retrieval-augmented generation (RAG) architectures, Background in financial services or other highly regulated industries with strict reliability requirements

What You'll Do.

Design and implement end-to-end Agentic AI solutions

Develop intelligent automation frameworks

Build ML-powered monitoring and alerting systems

production-grade solutions on OpenShift and Kubernetes

Implement infrastructure-as-code using Ansible and containerization (Docker)

Partner with incident management and operations teams to translate operational pain points into AI-driven automation opportunities

Establish and track KPIs focused on reducing MTTR

and MTTI while improving system reliability

Lead technical design discussions and contribute to architectural decisions

How You'll Work.

Team & Collaboration

Collaborate effectively with cross-functional teams; Partner with incident management and operations teams

Full Job Description

**_Job Description_** **WHAT IS THE OPPORTUNITY?** Join RBC's Site Reliability Engineering team as a founding member building the bank's **first-ever Agentic AI platform for Software reliability** and**resiliency**. You'll pioneer intelligent automation systems that autonomously prevent incidents, accelerate response times, and transform how we maintain resilience across enterprise systems. This is a rare opportunity to shape the future of AI-driven reliability at scale. Your innovations will protect millions of daily customer transactions and sign-ins. With a clear technical leadership trajectory, you'll architect cutting-edge solutions at the intersection of AI and infrastructure, setting the standard for autonomous operations in financial services. **WHAT WILL YOU DO?** * **Design and implement end-to-end Agentic AI solutions** that autonomously detect anomalies, identify root causes, and resolve incidents with minimal human intervention * **Develop intelligent automation frameworks** using **LangChain** and **LangGraph** to create context-aware agents that learn from incident patterns and continuously improve response strategies * Build ML-powered monitoring and alerting systems that distinguish signal from noise, dramatically reducing false positives and improving MTTD (Mean Time to Detect) and MTTI (Mean Time to Identify) * Architect scalable, production-grade solutions on OpenShift and Kubernetes that process real-time system metrics and telemetry data at enterprise scale * Implement infrastructure-as-code using Ansible and containerization (Docker) to ensure reproducibility, consistency, and rapid deployment across environments * Partner with incident management and operations teams to translate operational pain points into AI-driven automation opportunities that measurably reduce toil * Establish and track KPIs focused on reducing MTTR (Mean Time to Resolve), MTTD, and MTTI while improving system reliability * Lead technical design discussions and contribut

Free ATS check

Applying for this Senior AI/ML Engineer - Site Reliability Engineering role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 44 detected · ranked by frequency

Designing machine learning models ×3

Training machine learning models ×3

Deploying machine learning models ×3

Building autonomous, multi-agent systems ×3

Model Context Protocol (MCP) ×3

Tool-calling capabilities ×3

Memory management ×3

Reasoning chains ×3

Containerization ×3

Orchestration ×3

Infrastructure-as-code ×3

Agentic AI platform development ×2

Software reliability ×2

Resiliency ×2

Intelligent automation systems ×2

Machine Learning ×2

LangChain ×2

LangGraph ×2

scikit-learn ×2

TensorFlow ×2

PyTorch ×2

Docker ×2

Kubernetes ×2

OpenShift ×2

Ansible ×2

Terraform ×2

Agentic AI

Python

LLMs

RAG

Translate complex technical concepts into business value

Site Reliability Engineering

BEHAVIOURAL

Collaborate effectively with cross-functional teams

Role Details

Seniority senior

Experience 5–10 yrs

Level Senior

Type FULL TIME

AI-Extracted Insights

Domain Areas

software-reliabilityresiliencyfinancial-serviceshighly-regulated-industries

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about RBC?

Real rants from real employees. Read before you apply.

Read Company Rants →