Company

Technology

SiteReliabilityEngineer-AIAgents

CA$135–195k ~AI est. Canada FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Site Reliability Engineer - AI Agents. Skills: Site Reliability Engineering, Cloud infrastructure, AI agent systems, Kubernetes. Design infrastructure layer. Operate infrastructure layer”

Industry & Context.

Technology

Eligibility Requirements

On-call operational ownership

What They're Looking For.

Must Have

5+ years of experience in SRE, Hands-on experience supporting ML infrastructure, Experience building developer platforms, Proficiency with Infrastructure as Code tools, Experience with Kubernetes, Solid cloud infrastructure experience, Scripting and programming skills, Experience designing and operating observability systems, Experience with incident response processes, On-call operational ownership

Nice to Have

Familiarity with AI agent systems, Familiarity with LLM-based applications, Familiarity with orchestration frameworks

What You'll Do.

Design infrastructure layer

Operate infrastructure layer

Scale infrastructure layer

Design cloud infrastructure

Build cloud infrastructure

Operate cloud infrastructure

Develop platform services

Develop self-service tooling

Manage compute infrastructure

Manage orchestration infrastructure

Manage deployment infrastructure

Build CI/CD pipelines

Maintain CI/CD pipelines

Implement Infrastructure as Code

Provision AWS environments

Manage AWS environments

Design monitoring systems

Operate monitoring systems

Design logging systems

Operate logging systems

Design alerting systems

Operate alerting systems

Design incident response systems

Operate incident response systems

Define reliability patterns

Define failure recovery mechanisms

Collaborate with AI teams

Collaborate with Data Engineering teams

Manage Kubernetes environments

Implement security controls

Implement access management

Implement infrastructure best practices

Document architecture

Document operational procedures

How You'll Work.

Team & Collaboration

AI teams; Data Engineering teams; Engineering teams

Full Job Description

## Accountabilities You will be responsible for designing, operating, and scaling the infrastructure layer that powers AI agent systems in production, ensuring reliability, observability, and developer usability across the platform. Design, build, and operate scalable cloud infrastructure supporting AI agent execution, orchestration, and model serving in production Ensure reliability, performance, and observability of distributed agentic systems across internal and external products Develop platform services, APIs, SDKs, and self-service tooling to enable efficient consumption of AI infrastructure Manage compute, orchestration, and deployment infrastructure supporting AI and ML workloads at scale Build and maintain CI/CD pipelines for reliable, automated deployment of AI services and agent workflows Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS environments Design and operate monitoring, logging, alerting, and incident response systems tailored to AI/ML workloads Define reliability patterns, guardrails, and failure recovery mechanisms for LLM and agent-based systems Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems Manage Kubernetes-based container orchestration environments for scalable deployment of services Implement security controls, access management, and infrastructure best practices across systems Document architecture, runbooks, and operational procedures to support platform adoption and reliability Requirements The ideal candidate is a strong SRE or platform engineer with experience in cloud-native systems, production infrastructure, and exposure to ML or AI-driven workloads. 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles in production environments Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production Experience building developer pla

Free ATS check

Applying for this Site Reliability Engineer - AI Agents role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 25 detected · ranked by frequency

Cloud infrastructure ×5

Kubernetes ×4

Containerized systems ×3

CI/CD pipelines ×3

Monitoring systems ×3

Logging systems ×3

Alerting systems ×3

Site Reliability Engineering ×2

AI agent systems ×2

Terraform ×2

Docker ×2

AWS

Python

Bash

Shell

Platform engineering

Developer experience design

Infrastructure management

Orchestration

Model serving

Incident response

Failure recovery

Security controls

Access management

Best practices

Role Details

Seniority Senior

Experience 5–10 yrs

Level Senior

Work Mode Remote

Type FULL TIME

Category software

Salary Band 100k-150k

AI-Extracted Insights

Domain Areas

ai-agent-systemsllm-based-systemscloud-native-systemsml-infrastructuremodel-servingmlops-pipelinesplatform-engineering-principlesdistributed-agentic-systems

How to Apply on Lever

Lever uses a streamlined one-page form — apply in under 5 minutes.
LinkedIn import works well; review parsed data before submitting.
The cover letter field is optional but visible to reviewers — use it to differentiate.
Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →