Nscale

Technology

SiteReliabilityEngineer

$100–170k United States; Canada Remote Friendly
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Mid+ candidates.

The Brief

“Site Reliability Engineer at Nscale. Skills: Site Reliability Engineering, Automation, AI workloads. Build automation. Improve automation”

Industry & Context.

Technology
Problems you'll solve

Troubleshooting; Incident response; Root cause analysis

What They're Looking For.

Must Have

2-5 years SRE/Systems/Software Engineering, 2+ years programming skills, Working knowledge of Linux, Working knowledge of networking, Working knowledge of distributed systems, Experience troubleshooting production issues

Nice to Have

Exposure to cloud platforms, Exposure to Kubernetes, Exposure to virtualized/bare-metal, Experience in AI workloads, Experience in GPU workloads, Experience in HPC, Basic understanding of high-performance networking, Exposure to production monitoring/alerting

What You'll Do.

Improve infrastructure

Support development of operational systems

Support development of platform services

Maintain monitoring dashboards

Participate in incident response

Participate in troubleshooting

Participate in post-incident reviews

Investigate performance issues

Resolve performance issues

Investigate reliability issues

Resolve reliability issues

Improve system stability

Contribute to availability

Contribute to scalability

Contribute to operational efficiency

Learn from senior engineers

How You'll Work.

Team & Collaboration

Collaborate with Engineering; Collaborate with Networking; Collaborate with Infrastructure

Full Job Description

About Nscale Nscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute. Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI. What You’ll Be Doing Help build and improve automation, tooling, and infrastructure that supports AI workloads Support the development of operational systems and platform services Assist in defining and maintaining basic SLOs/SLIs and monitoring dashboards Participate in incident response, troubleshooting, and post-incident reviews Investigate and help resolve performance and reliability issues across systems Collaborate with Engineering, Networking, and Infrastructure teams to improve system stability Contribute to improving availability, scalability, and operational efficiency Learn from senior engineers and grow your expertise in reliability engineering What You Bring 2–5 years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering in Data Center Environment 2+ years programming skills (e.g., Python, Go, or similar) with interest in automation and tooling Working knowledge of Linux systems, networking concepts, and distributed systems Experience troubleshooting system or application issues in production environments Familiarity with monitoring or observability tools (e.g., logs, metrics, dashboards) Strong willingness to learn and improve reliability and operational practices Ability to work in fast-paced environments and collaborate across teams Preferred Experience Exposure to cloud platforms, Kubernetes, or virtualized/bare-metal environments Experience in AI, GPU workloads, or h

Free ATS check

Applying for this Site Reliability Engineer role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about Nscale?

Real rants from real employees. Read before you apply.

Read Company Rants →