Mastercard
Financial Services
LeadEngineer,SiteReliabilityEngineering
Neural analysis suggests this role is
optimal for Lead candidates.
“Lead Engineer, Site Reliability Engineering at Mastercard. Skills: Site Reliability Engineering, infrastructure troubleshooting, observability, monitoring, automation, Infrastructure as Code, incident management, root cause analysis. Lead continuous assessments of the application infrastructure supporting critical Mastercard applications, focusing on health, performance, monitoring and alerting, and capacity analysis. Collaborate with Product and Development teams to forecast growth requirements”
What You'll Achieve.
ensuring the availability, scalability, and resilience of our network; reducing Mean Time to Detect (MTTD) and Mean Time to Mitigate (MTTM); certify environment readiness before customer traffic is routed to it; strengthen multi-disciplinary SRE team capabilities
Industry & Context.
Excellent infrastructure troubleshooting and analytical problem solving skills; proven ability to triage and investigate complex issues; Demonstrated ability to troubleshoot complex production issues, perform root cause analysis, and drive long term corrective actions; Effective incident management skills with a structured, analytical approach to problem solving
periodic on-call responsibilities
What They're Looking For.
Must Have
5–10 years of experience in an SRE or SRE related operations role, 3+ years supporting e commerce, financial services, or large scale SaaS platforms, Excellent infrastructure troubleshooting and analytical problem solving skills, hands on experience with observability and monitoring tools such as Splunk, Dynatrace, or equivalent, with a proven ability to triage and investigate complex issues, Familiarity with network telemetry tools such as SolarWinds and NetScout, Proficiency in packet level debugging, including capturing traffic with tools like tcpdump and analyzing packets using Wireshark, Broad understanding of end to end infrastructure supporting payment platforms—spanning platform services, networking, databases, and storage, Experience with automation and Infrastructure as Code tools such as Chef, Ansible, and Terraform, as well as structured data formats (JSON/YAML), Excellent communication skills with the ability to coordinate cross functional troubleshooting efforts and lead RCA processes to closure, Demonstrated ability to troubleshoot complex production issues, perform root cause analysis, and drive long term corrective actions, Experience partnering with development teams to shape architecture, define SLIs/SLOs, and embed reliability into services from design through operation, understanding of monitoring and observability ecosystems, including Prometheus, Grafana, ELK/EFK, Splunk, and OpenTelemetry, Effective incident management skills with a structured, analytical approach to problem solving
Nice to Have
Kubernetes a plus
What You'll Do.
Lead continuous assessments of the application infrastructure supporting critical Mastercard applications
monitoring and alerting
and capacity analysis
Collaborate with Product and Development teams to forecast growth requirements and ensure scalability and resiliency
Champion observability as a core principle for infrastructure services by assessing environments and technologies to uncover gaps in monitoring and alerting
Design and implement strategies to close these gaps
ensuring all infrastructure telemetry is integrated into a unified
single-pane-of-glass view
Build custom dashboards to investigate and perform root cause analysis on complex issues
Lead regular incident reviews with internal support teams to ensure root causes are identified
Develop and implement strategies to remediate or mitigate risks when patterns of failure or compatibility issues between software and infrastructure emerge
Leverage automation and AI technologies to enhance proactive issue detection
enable self-healing capabilities
reducing Mean Time to Detect (MTTD) and Mean Time to Mitigate (MTTM)
Develop testing and validation plans for new environment builds
disaster recovery exercises and post-maintenance activities to certify environment readiness before customer traffic is routed to it
Champion continuous learning
and knowledge sharing across networking and other infrastructure disciplines to strengthen multi-disciplinary SRE team capabilities
Lead training initiatives for team members and Product and Development on networking aspects of the platforms
Evaluate vendor hardware
and software upgrade roadmaps
and conduct proof-of-concept (POC) testing to identify potential risks and opportunities for improvement in upcoming releases
How You'll Work.
Team & Collaboration
Collaborate with Product and Development teams; coordinate cross functional troubleshooting efforts; partnering with development teams to shape architecture, define SLIs/SLOs, and embed reliability into services from design through operation; knowledge sharing across networking and other infrastructure disciplines; Lead training initiatives for team members and Product and Development
Communication Scope
Excellent communication skills; ability to coordinate cross functional troubleshooting efforts; lead RCA processes to closure
Process & Methodology
Lead continuous assessments, Lead regular incident reviews, Lead training initiatives, Develop testing and validation plans, conduct proof-of-concept (POC) testing
Full Job Description
**Our Purpose** _Mastercard powers economies and empowers people in 200 + countries and territories worldwide. Together with our customers, we’re helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships and networks combine to deliver a unique set of products and services that help people, businesses and governments realize their greatest potential._ **Title and Summary** ### Lead Engineer, Site Reliability Engineering ### Lead Engineer, Site Reliability Engineering Our Purpose: Mastercard powers economies and empowers people across more than 200 countries and territories worldwide. We are committed to building an inclusive, digital economy that benefits everyone, everywhere—by making transactions safe, simple, smart, and accessible. Through secure data, trusted networks, strong partnerships, and relentless innovation, we help individuals, financial institutions, governments, and businesses unlock their greatest potential. About the Role: Mastercard’s Program aligned Site Reliability Engineering (SRE) teams are dedicated to delivering a seamless experience for our customers. We achieve this by maintaining every aspect of our Programs infrastructure and technology ecosystem to the highest standards, ensuring compliance with rigorous security requirements. Within Mastercard, SRE focuses on the reliability and performance of core infrastructure, networks, and foundational services that power our applications. Our mission is to ensure these components operate with excellence, enabling applications to deliver an outstanding customer experience. In this role, you will join our Payments Network SRE team and take ownership of continuously assessing and elevating the end to end service quality of our platform. You will leverage data to drive root cause analysis and deliver strategic insights to key stakeholders on r
Applying for this Lead Engineer, Site Reliability Engineering role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about Mastercard?
Real rants from real employees. Read before you apply.