Company

Technology

SeniorMLOpsEngineer-SRE|DevOps

$200–350k ~AI est. Brazil FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior MLOps Engineer - SRE | DevOps. Skills: MLOps, SRE, Kubernetes, Infrastructure-as-Code. Design ML infrastructure. Build ML infrastructure”

Industry & Context.

Technology

Problems you'll solve

Troubleshooting

What They're Looking For.

Must Have

5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps, Hands-on experience deploying and managing ML/AI workloads, Deep SRE expertise, Advanced experience with Terraform, GitOps experience, Deep expertise in Kubernetes, AWS knowledge, Experience building CI/CD pipelines, Automation mindset

Nice to Have

Experience with GPU/accelerator scheduling, Experience operating LLM inference systems, Experience with ML orchestration tools, Familiarity with ML observability tools, Background in FinOps, Experience with multi-tenant infrastructure, Exposure to feature stores, Experience scaling ML platforms

What You'll Do.

Design ML infrastructure

Build ML infrastructure

Operate ML infrastructure

Support real-time workloads

Support batch workloads

Own ML deployment lifecycle

Manage model registry

Manage rollout strategies

Manage safe rollback mechanisms

Operate LLM workloads

Manage inference providers

Manage fallback strategies

Maintain ML pipelines

Implement Infrastructure-as-Code

Ensure multi-account cloud architectures

Manage GitOps workflows

Ensure reliable deployments

Ensure consistent deployments

Operate Kubernetes infrastructure

Manage GPU scheduling

Manage workload isolation

Manage cost-aware scaling

Define SRE best practices

Enforce SRE best practices

Manage incident response

Manage performance monitoring

Drive cost optimization

Optimize ML workloads

Improve infrastructure utilization

Use agentic coding tools

How You'll Work.

Team & Collaboration

Cross-functional teams; Global time zones

Communication Scope

Articulate technical decisions; Articulate trade-offs; Articulate incident analysis

Process & Methodology

Roadmap planning

Full Job Description

## Accountabilities Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants. Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies (canary, A/B, shadow), and safe rollback mechanisms. Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load. Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation. Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures. Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments. Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies. Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems. Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization. Improve automation across the ML lifecycle using modern engineering and agentic coding tools. Requirements: 5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale. Strong hands-on experience deploying and managing ML/AI workloads in production environments. Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices. Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments. Strong GitOps experience with declarative infrastructure and deployment workflows. Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting. Strong AWS knowledge, including networking, IAM, compute, storage, and distribut

Free ATS check

Applying for this Senior MLOps Engineer - SRE | DevOps role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 70 detected · ranked by frequency

Kubernetes ×4

MLOps ×3

SRE ×3

Infrastructure-as-Code ×3

Model registry ×3

Versioning ×3

Rollout strategies ×3

Canary deployments ×3

A/B testing ×3

Shadow deployments ×3

Safe rollback ×3

Inference providers ×3

Throttling ×3

Quotas ×3

Fallback strategies ×3

Training ×3

Evaluation ×3

Lineage ×3

Automation ×3

Multi-account architectures ×3

Deployment workflows ×3

Production operations ×3

Failure-mode troubleshooting ×3

Networking ×3

IAM ×3

Compute ×3

Storage ×3

Distributed architectures ×3

Cost attribution ×3

Node lifecycle management ×3

Caching ×3

Guardrails ×3

BEHAVIOURAL

Communication skills

Role Details

Seniority Senior

Experience 5–10 yrs

Level Senior

Work Mode Remote

Type FULL TIME

Category software

Salary Band 200k+

AI-Extracted Insights

Domain Areas

ml-infrastructureinference-infrastructureai-workloadsllm-workloadscloud-architecturesmulti-tenant-infrastructuredistributed-systems

How to Apply on Lever

Lever uses a streamlined one-page form — apply in under 5 minutes.
LinkedIn import works well; review parsed data before submitting.
The cover letter field is optional but visible to reviewers — use it to differentiate.
Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →