Company
Technology
SiteReliabilityEngineer-AIAgents
Neural analysis suggests this role is
optimal for Senior candidates.
“Site Reliability Engineer - AI Agents. Skills: Site Reliability Engineering, Cloud infrastructure, AI agent systems, Kubernetes. Design infrastructure layer. Operate infrastructure layer”
Industry & Context.
On-call operational ownership
What They're Looking For.
Must Have
5+ years of experience in SRE, Hands-on experience supporting ML infrastructure, Experience building developer platforms, Proficiency with Infrastructure as Code tools, Experience with Kubernetes, Solid cloud infrastructure experience, Scripting and programming skills, Experience designing and operating observability systems, Experience with incident response processes, On-call operational ownership
Nice to Have
Familiarity with AI agent systems, Familiarity with LLM-based applications, Familiarity with orchestration frameworks
What You'll Do.
Design infrastructure layer
Operate infrastructure layer
Scale infrastructure layer
Design cloud infrastructure
Build cloud infrastructure
Operate cloud infrastructure
Develop platform services
Develop self-service tooling
Manage compute infrastructure
Manage orchestration infrastructure
Manage deployment infrastructure
Build CI/CD pipelines
Maintain CI/CD pipelines
Implement Infrastructure as Code
Provision AWS environments
Manage AWS environments
Design monitoring systems
Operate monitoring systems
Design logging systems
Operate logging systems
Design alerting systems
Operate alerting systems
Design incident response systems
Operate incident response systems
Define reliability patterns
Define failure recovery mechanisms
Collaborate with AI teams
Collaborate with Data Engineering teams
Manage Kubernetes environments
Implement security controls
Implement access management
Implement infrastructure best practices
Document architecture
Document operational procedures
How You'll Work.
Team & Collaboration
AI teams; Data Engineering teams; Engineering teams
Full Job Description
## Accountabilities You will be responsible for designing, operating, and scaling the infrastructure layer that powers AI agent systems in production, ensuring reliability, observability, and developer usability across the platform. Design, build, and operate scalable cloud infrastructure supporting AI agent execution, orchestration, and model serving in production Ensure reliability, performance, and observability of distributed agentic systems across internal and external products Develop platform services, APIs, SDKs, and self-service tooling to enable efficient consumption of AI infrastructure Manage compute, orchestration, and deployment infrastructure supporting AI and ML workloads at scale Build and maintain CI/CD pipelines for reliable, automated deployment of AI services and agent workflows Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS environments Design and operate monitoring, logging, alerting, and incident response systems tailored to AI/ML workloads Define reliability patterns, guardrails, and failure recovery mechanisms for LLM and agent-based systems Collaborate with AI and Data Engineering teams to evolve experimental prototypes into production-grade systems Manage Kubernetes-based container orchestration environments for scalable deployment of services Implement security controls, access management, and infrastructure best practices across systems Document architecture, runbooks, and operational procedures to support platform adoption and reliability Requirements The ideal candidate is a strong SRE or platform engineer with experience in cloud-native systems, production infrastructure, and exposure to ML or AI-driven workloads. 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles in production environments Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production Experience building developer pla
Applying for this Site Reliability Engineer - AI Agents role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.