Company

Technology

SiteReliabilityEngineer-AIAgents

€85–130k ~AI est. Bulgaria FULL TIME Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Site Reliability Engineer - AI Agents. Skills: Site Reliability Engineering, Cloud-native infrastructure, AI/ML infrastructure, Agentic systems. Design resilient infrastructure systems. Operate resilient infrastructure systems”

Industry & Context.

Technology
Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

5+ years SRE experience, 5+ years Platform Engineering, 5+ years Infrastructure Engineering, Hands-on ML systems experience, Hands-on model serving experience, Hands-on MLOps pipelines experience, Experience building developer platforms, Experience building internal tools, Experience building APIs, Experience building SDKs, Proficiency with Terraform, Advanced Kubernetes experience, Solid AWS infrastructure experience, Python programming skills, Bash/shell proficiency, Experience designing observability systems, Experience designing logging systems, Experience designing monitoring systems, Experience designing alerting systems, Proven incident response experience, Proven on-call rotation experience, Proven production reliability ownership, Cross-functional collaboration skills

Nice to Have

AI/ML infrastructure exposure, Agent-based systems exposure, Familiarity with AI/agent systems, Familiarity with orchestration frameworks, Familiarity with LLM-based applications

What You'll Do.

Design resilient infrastructure systems

Operate resilient infrastructure systems

Scale resilient infrastructure systems

Design cloud-native infrastructure

Build cloud-native infrastructure

Operate cloud-native infrastructure

Ensure reliability of systems

Ensure observability of systems

Ensure performance of systems

Develop platform services

Develop self-service tooling

Manage compute layers

Optimize compute layers

Manage orchestration layers

Optimize orchestration layers

Manage serving layers

Optimize serving layers

Build CI/CD pipelines

Maintain CI/CD pipelines

Implement Infrastructure as Code

Provision AWS infrastructure

Manage AWS infrastructure

Design monitoring systems

Design alerting systems

Design observability systems

Define reliability patterns

Enforce reliability patterns

Define failure recovery mechanisms

Enforce failure recovery mechanisms

Transform prototypes into production systems

Manage Kubernetes environments

Ensure scalable workload deployment

Ensure efficient workload deployment

Implement security best practices

Implement access controls

Document system architecture

Document operational procedures

How You'll Work.

Team & Collaboration

AI teams; Data Engineering teams; Product teams; Engineering teams

Full Job Description

## Accountabilities You will be responsible for designing, operating, and scaling resilient infrastructure systems that support AI agent workloads in production, ensuring reliability, scalability, and developer usability across the platform. Design, build, and operate cloud-native infrastructure supporting AI agent execution, orchestration, and model serving at scale Ensure reliability, observability, and performance of distributed agentic systems across internal and external-facing products Develop platform services, APIs, SDKs, and self-service tooling to enable teams to efficiently consume AI infrastructure capabilities Manage and optimize compute, orchestration, and serving layers for AI and ML workloads in production environments Build and maintain CI/CD pipelines to enable safe, fast, and reliable deployment of AI services and agent workflows Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based infrastructure Design monitoring, alerting, and observability systems tailored to AI/ML and agent-based workloads Define and enforce reliability patterns, guardrails, and failure recovery mechanisms for LLM and agentic systems Collaborate with AI, Data Engineering, and Product teams to transform experimental prototypes into production-ready systems Manage Kubernetes-based container orchestration environments, ensuring scalable and efficient workload deployment Implement security best practices and access controls across infrastructure and platform services Document system architecture, operational procedures, and runbooks to support team knowledge sharing and reliability Requirements The ideal candidate is a strong platform-minded engineer with deep SRE experience, a solid understanding of cloud-native systems, and exposure to AI/ML infrastructure or agent-based systems. 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar production-focused roles Hands-on experie

Free ATS check

Applying for this Site Reliability Engineer - AI Agents role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →