Company

DevOps

SiteReliabilityEngineer(SRE)

$1200–1800k ~AI est. Taipei City, Taiwan FULL TIME
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid candidates.

The Brief

“Site Reliability Engineer (SRE). Skills: Site Reliability Engineering, DevOps, Cloud Infrastructure, Observability. Design monitoring systems. Maintain monitoring systems”

Industry & Context.

DevOps
Problems you'll solve

Troubleshooting; Root-cause analysis; Anomaly detection

Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

3+ years of experience, Hands-on experience with AWS, Solid Linux administration, Experience with Docker, Experience with Kubernetes, Experience with Terraform, Proficiency in Python, Proficiency in Bash, Understanding of networking fundamentals, Understanding of infrastructure security, Experience supporting production systems, Participating in incident response

Nice to Have

Experience operating edge computing, Experience operating IoT deployments, Familiarity with zero-trust access, Experience in security operations, Experience in threat detection, Experience in infrastructure security, Exposure to AI infrastructure, Exposure to LLM-based applications, Exposure to workflow automation, Knowledge of AI-Ops, Knowledge of anomaly detection, Knowledge of intelligent monitoring, Familiarity with ISO 27001

What You'll Do.

Design monitoring systems

Maintain monitoring systems

Design alerting systems

Maintain alerting systems

Design dashboarding systems

Maintain dashboarding systems

Build visibility into metrics

Build visibility into logs

Build visibility into traces

Build visibility into analytics

Define reliability targets

Manage reliability targets

Develop proactive monitoring

Develop anomaly detection

Deploy containerized workloads

Manage containerized workloads

Optimize containerized workloads

Maintain cloud infrastructure

Improve system performance

Improve system availability

Improve operational efficiency

Support infrastructure provisioning

Implement access controls

Implement audit mechanisms

Monitor cybersecurity threats

Monitor unauthorized access

Monitor service disruptions

Develop alerting procedures

Develop response procedures

Contribute to best practices

Automate operational tasks

Support CI/CD workflows

Support deployment automation

Promote documentation

Promote operational standards

Promote continuous improvement

Participate in on-call

Participate in incident management

Lead troubleshooting efforts

Conduct root-cause analysis

Conduct post-mortem reviews

Drive long-term improvements

Work closely with software teams

Work closely with AI teams

Work closely with ML teams

Work closely with hardware teams

Work closely with product teams

Ensure services are production-ready

Support cloud environments

Support edge computing environments

How You'll Work.

Team & Collaboration

Cross-functional teams; Software teams; AI teams; Machine learning teams; Hardware teams; Product teams

Full Job Description

## What you will do Reliability & Observability Design and maintain monitoring, alerting, and dashboarding systems across cloud and edge environments. Build visibility into system health through metrics, logs, traces, and performance analytics. Define and manage SLIs, SLOs, and service reliability targets. Develop proactive monitoring and anomaly detection capabilities to identify issues before they impact users. Cloud Infrastructure & Platform Operations Deploy, manage, and optimize containerized workloads running on Kubernetes. Maintain scalable cloud infrastructure across production environments. Improve system performance, availability, and operational efficiency. Support infrastructure provisioning through Infrastructure-as-Code practices. Security & Access Management Implement secure access controls and audit mechanisms across infrastructure environments. Monitor for cybersecurity threats, unauthorized access attempts, and service disruptions. Develop alerting and response procedures for security-related incidents. Contribute to operational security best practices and governance initiatives. Automation & Engineering Excellence Automate repetitive operational tasks to reduce manual effort and improve reliability. Build tooling and scripts to streamline infrastructure operations. Support CI/CD workflows and deployment automation. Promote documentation, operational standards, and continuous improvement. Incident Response & Reliability Engineering Participate in on-call rotations and incident management. Lead troubleshooting efforts during production incidents. Conduct root-cause analysis and post-mortem reviews. Drive long-term improvements that enhance system resilience. Cross-Functional Collaboration Work closely with software, AI, machine learning, hardware, and product teams. Ensure new services are production-ready with appropriate monitoring, security, and reliability measures. Support the operational needs of both cloud-based and distributed edge computing

Free ATS check

Applying for this Site Reliability Engineer (SRE) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about this company?

Real rants from real employees. Read before you apply.

Read Company Rants →