xAI

Technology

Sr.SoftwareEngineer(DataCenterAutomation)

$175–250k ~AI est. Memphis, Tennessee, United States
Market Sentiment
HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Sr. Software Engineer (Data Center Automation) at xAI. Skills: Data Center Automation, Reliability Engineering, Observability, Automation. Design, develop, and deploy scalable code and services. Automate reliability workflows”

What You'll Achieve.

Reduce mean time to recovery (MTTR) by up to 50%; Mitigate downtime; Minimize impact to end-users; Ensure infrastructure remains resilient; Ensure infrastructure remains scalable; Accelerate resolution

Industry & Context.

Technology
Problems you'll solve

Troubleshoot complex issues; Versatile problem-solvers; Root cause analysis

Eligibility Requirements

On-call rotations

What They're Looking For.

Must Have

Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or related technical field (or equivalent professional experience), 3+ years of hands-on experience in site reliability engineering, infrastructure engineering, DevOps, or systems engineering, 3+ years of experience in Python, Solid experience with Linux systems administration, Practical knowledge of containerization and orchestration technologies, Experience implementing observability solutions, Familiarity with troubleshooting complex issues in distributed systems, Understanding of networking fundamentals, Experience participating in on-call rotations, Ability to collaborate effectively with cross-functional teams

Nice to Have

5+ years of experience in SRE or infrastructure roles, Hands-on experience operating or scaling Kubernetes clusters, Proficiency in Rust, Direct experience integrating software reliability tools with physical data center infrastructure, Exposure to advanced or innovative observability stacks, Experience building automated remediation systems, Background in optimizing Linux-based systems for AI workloads, Demonstrated success reducing downtime, Prior work with bare-metal provisioning, Mentoring experience, Documentation skills, Comfort with rapid technology adaptation

What You'll Do.

and deploy scalable code and services

Automate reliability workflows

Implement and maintain observability tools and practices

Provide real-time insights into system health

Collaborate with cross-functional teams

Identify reliability bottlenecks

Automate solutions for fault tolerance

Automate solutions for disaster recovery

Automate solutions for capacity planning

Automate solutions for physical/environmental risk mitigation

Troubleshoot and resolve complex issues

Optimize Linux-based systems for performance

Optimize Linux-based systems for security

Optimize Linux-based systems for reliability

Understand network topologies and concepts

Integrate observability into data center interconnects

Integrate observability into facility-level controls

Participate in on-call rotations

Participate in post-incident reviews

Participate in continuous improvement initiatives

Mentor junior team members

How You'll Work.

Team & Collaboration

Cross-functional teams; Software development teams; Network engineering teams; Site operations teams; Facility operations teams; Mechanical/electrical teams

Communication Scope

Share knowledge concisely; Accurate knowledge sharing

Process & Methodology

Error budgets, SLAs

Full Job Description

ABOUT xAI xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates. ABOUT THE ROLE: We are seeking a highly skilled Sr. Software Engineer to join our team in managing and enhancing reliability across a multi-data center environment. This role focuses on automating processes, building and implementing robust observability solutions, and ensuring seamless operations for mission-critical AI infrastructure. The ideal candidate will combine strong coding abilities with hands-on data center experience to build scalable reliability services, optimize system performance, and minimize downtime—including close partnership with facility operations to address physical infrastructure impacts. If you thrive in lightning-fast, distributed environments and are passionate about leveraging automation to drive efficiency, this is an opportunity to make a significant impact on our infrastructure's resilience and scalability. In an era where AI workloads demand near-zero downtime, this position plays a pivotal role in bridging software engineering principles with physical data center realities. By prioritizing automation and observability, team members in this role can reduce mean time to recovery (MTTR) by up to 50% through proactive monitoring and automated remediation, based on industry benchmarks from high-scal

Free ATS check

Applying for this Sr. Software Engineer (Data Center Automation) role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Greenhouse

  • Create a Greenhouse profile before applying — it saves time across multiple applications.
  • Upload your resume as a PDF; the parser handles it better than Word.
  • Answer all knockout questions carefully — wrong answers auto-reject before a human sees you.
  • Enable email notifications to track application status in real time.

ANONYMOUS · UNFILTERED

What do employees actually say about xAI?

Real rants from real employees. Read before you apply.

Read Company Rants →