Company
Technology
SeniorMLOpsEngineer-SRE|DevOps
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior MLOps Engineer - SRE | DevOps. Skills: MLOps, SRE, Kubernetes, Infrastructure-as-Code. Design ML infrastructure. Build ML infrastructure”
Industry & Context.
Troubleshooting
What They're Looking For.
Must Have
5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps, Hands-on experience deploying and managing ML/AI workloads, Deep SRE expertise, Advanced experience with Terraform, GitOps experience, Deep expertise in Kubernetes, AWS knowledge, Experience building CI/CD pipelines, Automation mindset
Nice to Have
Experience with GPU/accelerator scheduling, Experience operating LLM inference systems, Experience with ML orchestration tools, Familiarity with ML observability tools, Background in FinOps, Experience with multi-tenant infrastructure, Exposure to feature stores, Experience scaling ML platforms
What You'll Do.
Design ML infrastructure
Build ML infrastructure
Operate ML infrastructure
Support real-time workloads
Support batch workloads
Own ML deployment lifecycle
Manage model registry
Manage rollout strategies
Manage safe rollback mechanisms
Operate LLM workloads
Manage inference providers
Manage fallback strategies
Maintain ML pipelines
Implement Infrastructure-as-Code
Ensure multi-account cloud architectures
Manage GitOps workflows
Ensure reliable deployments
Ensure consistent deployments
Operate Kubernetes infrastructure
Manage GPU scheduling
Manage workload isolation
Manage cost-aware scaling
Define SRE best practices
Enforce SRE best practices
Manage incident response
Manage performance monitoring
Drive cost optimization
Optimize ML workloads
Improve infrastructure utilization
Use agentic coding tools
How You'll Work.
Team & Collaboration
Cross-functional teams; Global time zones
Communication Scope
Articulate technical decisions; Articulate trade-offs; Articulate incident analysis
Process & Methodology
Roadmap planning
Full Job Description
## Accountabilities Design, build, and operate scalable ML and inference infrastructure supporting real-time and batch workloads across multiple tenants. Own the end-to-end ML deployment lifecycle, including model registry, versioning, rollout strategies (canary, A/B, shadow), and safe rollback mechanisms. Operate and optimize production-grade AI and LLM workloads, managing inference providers, throttling, quotas, and fallback strategies under load. Develop and maintain reproducible ML pipelines for training, evaluation, and deployment with full lineage and automation. Implement Infrastructure-as-Code practices using Terraform, ensuring scalable multi-account cloud architectures. Manage GitOps workflows using tools such as ArgoCD to ensure reliable and consistent deployments across environments. Operate Kubernetes-based infrastructure (AWS EKS), including GPU scheduling, workload isolation, and cost-aware scaling strategies. Define and enforce SRE best practices, including SLOs, observability, incident response, and performance monitoring for ML systems. Drive cost optimization initiatives across ML workloads, including resource right-sizing and efficient infrastructure utilization. Improve automation across the ML lifecycle using modern engineering and agentic coding tools. Requirements: 5+ years of experience in Platform Engineering, SRE, DevOps, or MLOps roles, operating production systems at scale. Strong hands-on experience deploying and managing ML/AI workloads in production environments. Deep SRE expertise, including SLO definition, incident response, postmortems, and reliability engineering practices. Advanced experience with Infrastructure-as-Code using Terraform in complex, multi-account environments. Strong GitOps experience with declarative infrastructure and deployment workflows. Deep expertise in Kubernetes, including production operations and failure-mode troubleshooting. Strong AWS knowledge, including networking, IAM, compute, storage, and distribut
Applying for this Senior MLOps Engineer - SRE | DevOps role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.