Company
Technology
StaffMachineLearningSystemsEngineer(MLOps)
Neural analysis suggests this role is
optimal for Senior candidates.
“Staff Machine Learning Systems Engineer (MLOps). Skills: MLOps, ML systems engineering, Kubernetes, Infrastructure-as-code. Lead design of ML infrastructure platform. Lead evolution of ML infrastructure platform”
What You'll Achieve.
Improve performance; Improve cost efficiency; Improve deployment speed
Industry & Context.
Root cause analysis
What They're Looking For.
Must Have
8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles, Hands-on ML/AI systems experience, Kubernetes (preferably EKS) expertise, Proficiency in infrastructure-as-code tools such as Terraform, Solid programming skills in Python, Experience building infrastructure tooling and automation systems, Experience operating LLM or ML inference systems in production, Hands-on experience with observability stacks, Understanding of CI/CD systems, Understanding of GitOps workflows, Understanding of developer platform engineering, Experience designing IAM, OIDC, and secrets management systems, Systems-thinking mindset, Ability to collaborate across engineering, ML, security, and product teams
Nice to Have
Experience in regulated or high-compliance environments (healthcare, fintech, or similar) is a plus
What You'll Do.
Lead design of ML infrastructure platform
Lead evolution of ML infrastructure platform
Lead operation of ML infrastructure platform
Support AI workloads across production systems
Ensure scalability across environments
Ensure reliability across environments
Ensure security across environments
Own Kubernetes-based infrastructure
Optimize Kubernetes-based infrastructure
Manage autoscaling for ML systems
Manage workload orchestration for ML systems
Manage cluster lifecycle for ML systems
Build GitOps-based CI/CD pipelines
Maintain GitOps-based CI/CD pipelines
Enable efficient deployment of AI services
Design model serving infrastructure
Implement model serving infrastructure
Design inference infrastructure
Implement inference infrastructure
Support multi-provider integrations
Develop observability systems for AI workloads
Develop tracing systems for AI workloads
Develop monitoring systems for AI workloads
Define SLOs for ML systems
Enforce SLOs for ML systems
Define incident response processes
Enforce incident response processes
Define reliability standards for ML systems
Enforce reliability standards for ML systems
Own infrastructure-as-code
Improve developer velocity
Drive security architecture
Drive IAM architecture
Drive secrets management architecture
Ensure least-privilege access
Ensure data protection standards
Translate research into production-ready systems
Translate prototypes into production-ready systems
Identify platform bottlenecks
Lead initiatives to improve performance
Lead initiatives to improve cost efficiency
Lead initiatives to improve deployment speed
Provide technical leadership
Provide architectural guidance
How You'll Work.
Team & Collaboration
Collaborate with ML teams; Collaborate with product teams; Collaborate with data teams; Collaborate with engineering teams; Collaborate with security teams
Process & Methodology
Roadmap planning
Full Job Description
## Accountabilities Lead the design, evolution, and operation of the core ML infrastructure platform supporting AI workloads across production systems, ensuring scalability, reliability, and security across environments. Own and optimize Kubernetes-based infrastructure (e.g., EKS), including autoscaling, workload orchestration, and cluster lifecycle management for ML and AI systems Build and maintain GitOps-based CI/CD pipelines enabling safe, repeatable, and efficient deployment of AI services across environments Design and implement model serving and inference infrastructure, including LLM routing, API gateways, and multi-provider integrations Develop observability, tracing, and monitoring systems for AI workloads using tools such as OpenTelemetry, Datadog, and LLM tracing platforms Define and enforce SLOs, incident response processes, and reliability standards for ML systems in production Own infrastructure-as-code and platform tooling (Terraform, CLIs, internal frameworks) to improve developer velocity and consistency Drive security, IAM, and secrets management architecture ensuring compliance, least-privilege access, and data protection standards Collaborate with ML, product, and data teams to translate research and prototypes into production-ready systems Identify platform bottlenecks and lead initiatives to improve performance, cost efficiency, and deployment speed Provide technical leadership, mentorship, and architectural guidance across ML systems engineering initiatives Requirements: This role requires deep expertise in cloud infrastructure, ML systems, and production-grade platform engineering, with a strong focus on reliability, scalability, and security. 8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles, including hands-on ML/AI systems experience Strong expertise with Kubernetes (preferably EKS), including cluster operations, autoscaling, and workload orchestration Proficiency in infrastructure-as-code tools such as
Applying for this Staff Machine Learning Systems Engineer (MLOps) role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Lever
- Lever uses a streamlined one-page form — apply in under 5 minutes.
- LinkedIn import works well; review parsed data before submitting.
- The cover letter field is optional but visible to reviewers — use it to differentiate.
- Referral codes from employees can significantly boost visibility of your application.
ANONYMOUS · UNFILTERED
What do employees actually say about this company?
Real rants from real employees. Read before you apply.