Mastercard
LeadSiteReliabilityEngineer
Neural analysis suggests this role is
optimal for Lead candidates.
“Lead Site Reliability Engineer at Mastercard. Skills: Site Reliability Engineering, Cloud Engineering, Kubernetes, Hybrid Cloud. Define SLIs, SLOs, and error budgets. Implement SLIs, SLOs, and error budgets”
What You'll Achieve.
Improved cross-cloud resiliency; Improved DR posture; Reduced hybrid networking incidents; Improved SLO compliance; Reduced MTTR; Increased automation coverage; Reduced change failure rate
Industry & Context.
Troubleshooting; Root cause analysis
Rotational on-call shifts, Weekends and off-hours support
What They're Looking For.
Must Have
7–10+ years in SRE/DevOps/Cloud Engineering, Deep hands-on experience in AWS and Azure, Deep hands-on experience in hybrid networking, Deep hands-on experience in Kubernetes (cloud & on-prem), Knowledge of Linux internals, Knowledge of TCP/IP, DNS, Load Balancing, Knowledge of TLS/PKI and certificate lifecycle, Knowledge of Distributed systems architecture, Scripting/programming skills (Python preferred), Experience designing cross-cloud DR and failover models, Experience with infrastructure as code, Experience with GitOps
Nice to Have
AWS Solutions Architect certification, Azure Architect certification, Azure DevOps Engineer certification, Certified Kubernetes Administrator (CKA)
What You'll Do.
Architect high-availability designs
Eliminate single points of failure
Conduct resilience validation
Perform chaos testing
Model failure scenarios
Engineer and operate workloads across AWS
Engineer and operate workloads across Azure
Design cross-cloud networking
Implement workload portability
Implement cloud-agnostic deployment strategies
Optimize cost across providers
Optimize performance across providers
Optimize reliability across providers
Design cloud-native autoscaling
Design load balancing strategies
Design traffic routing strategies
Integrate on-prem infrastructure with cloud
Integrate Active Directory / IAM federation
Integrate hybrid DNS architecture
Integrate secure certificate lifecycle management
Troubleshoot hybrid connectivity issues
Manage hybrid Kubernetes deployments
Manage private registry integrations
Support legacy-to-cloud modernization
Architect Kubernetes clusters
Operate Kubernetes clusters
Optimize cluster autoscaling
Optimize resource allocation
Optimize cluster performance
Implement cluster security hardening
Implement RBAC governance
Troubleshoot CNI issues
Troubleshoot ingress controller issues
Troubleshoot service mesh issues
Troubleshoot pod networking issues
Implement GitOps-driven deployments
Build unified observability
Implement centralized logging
Design distributed tracing
Engineer proactive alerting
Engineer resilient CI/CD pipelines
Implement infrastructure as code
Automate certificate rotation
Automate auto-scaling policies
Automate patch orchestration
Automate drift detection
Improve deployment reliability
Lead technical investigation of DNS failures
Lead technical investigation of TLS/PKI failures
Lead technical investigation of network latency
Lead technical investigation of memory leaks
Lead technical investigation of kernel-level issues
Lead technical investigation of thread contention
Lead technical investigation of CPU throttling
Perform packet-level debugging
Analyze distributed system failures
How You'll Work.
Team & Collaboration
Cross-cloud failover patterns; Follow-the-sun model
Process & Methodology
GitOps
Full Job Description
**Our Purpose** _Mastercard powers economies and empowers people in 200 + countries and territories worldwide. Together with our customers, we’re helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships and networks combine to deliver a unique set of products and services that help people, businesses and governments realize their greatest potential._ **Title and Summary** ### Lead Site Reliability Engineer ### Role Overview We are seeking a highly technical Lead Site Reliability Engineer (SRE) to architect, engineer, and operate highly reliable, scalable, and secure platforms across multi-cloud (AWS, Azure) and hybrid (on-prem + cloud) environments. This is a deeply hands-on engineering role requiring expertise in distributed systems, Kubernetes, hybrid networking, automation, CI/CD, observability, and production incident leadership. The Lead SRE will serve as the technical authority for reliability across interconnected cloud and datacenter ecosystems. Core Responsibilities 1\. Reliability Engineering Across Hybrid & Multi-Cloud • Define and implement SLIs, SLOs, and error budgets across cloud-native and on-prem workloads. • Architect high-availability designs spanning: o AWS and Azure regions o On-prem datacenters o Cross-cloud failover patterns • Design DR strategies (RTO/RPO driven) across hybrid environments. • Eliminate single points of failure across network, compute, storage, and DNS layers. • Conduct resilience validation, chaos testing, and failure scenario modeling. 2\. Multi-Cloud Architecture & Engineering • Engineer and operate workloads across: o Amazon Web Services o Microsoft Azure • Design cross-cloud networking (VPN, ExpressRoute, Direct Connect, Transit Gateway). • Implement workload portability and cloud-agnostic deployment strategies. • Optimize cost, performance, and reliability acros
Applying for this Lead Site Reliability Engineer role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Mastercard?
Real rants from real employees. Read before you apply.