NVIDIA

SeniorSolutionsArchitect-AIFactoryDeployment

$184–357k United States FULL TIME Remote Friendly

Market Sentiment

HIGH DEMAND

Neural analysis suggests this role is
optimal for Senior candidates.

The Brief

“Senior Solutions Architect - AI Factory Deployment at NVIDIA. Skills: AI Factory Deployment, Linux-based GPU clusters, NCCL, AllReduce, AllToAll, Python, Shell, observability, automation, benchmarking. develop, deploy, and validate AI factories end to end. running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters”

What You'll Achieve.

improve performance and scalability; improve benchmarks and validation; prepare AI factories for customers; improve throughput, latency, and scaling efficiency; prepare AI factories for customer use

Industry & Context.

Problems you'll solve

Serve as the expert when workloads or benchmarks do not perform flawlessly; Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform; troubleshoot and optimize complex distributed workloads

What They're Looking For.

Must Have

Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field, More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings, Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL, Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training, Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow, Proficiency with Python and Shellash for scripting, automation, and tooling, Experience with benchmarking (crafting, executing, and interpreting performance benchmarks), Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads, communication skills and the ability to work effectively with cross-functional teams

Nice to Have

Experience with AI factory or large-scale AI infrastructure build, deployment, or operations, Background in HPC performance engineering, SRE, or systems performance analysis for GPU-accelerated environments, Familiarity with observability stacks (e. g. , metrics/monitoring, logging, tracing systems) used for large distributed systems, Experience building automation and CI-style pipelines for running and validating benchmarks at scale, Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions

What You'll Do.

and validate AI factories end to end

running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters

using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability

bring to bear observability and automation to improve benchmarks and validation

serve as the expert when workloads or benchmarks do not perform flawlessly

ensure AI factories are prepared for customers

validating hardware and software for modern AI deployments

and verify AI factory environments across multi-GPU and multi-node Linux clusters

Ensure configurations align with guidelines for NCCL

and distributed training frameworks

Own the execution of key AI/LLM benchmarks

Investigate and resolve issues when training jobs or benchmarks fail

Build and improve observability for AI factories (metrics

Develop automation (Python

Shell) for running benchmarks

and performing regression checks

Examine communication patterns and NCCL usage for AI/LLM workloads

concentrating on collectives such as AllReduce and AllToAll

Recommend changes to job configuration

parallelism strategies

and cluster settings to improve throughput

and scaling efficiency

Work closely with hardware

and product teams to prepare AI factories for customer use

Contribute to documentation

and readiness collateral that support internal collaborators and customer-facing teams

How You'll Work.

Team & Collaboration

collaborate across NVIDIA; Work closely with hardware, software, networking, datacenter, and product teams; work effectively with cross-functional teams

Communication Scope

communication skills; ability to work effectively with cross-functional teams

Full Job Description

We are seeking an ambitious Senior Solutions Architect - AI Factory Deployment to join our NVIDIA Infrastructure Specialists team in Santa Clara! This role is uniquely positioned to develop, deploy, and validate AI factories end to end. You will focus on running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters, using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability. As part of our world-class team, you will bring to bear observability and automation to improve benchmarks and validation. You will serve as the expert when workloads or benchmarks do not perform flawlessly. You will collaborate across NVIDIA to ensure AI factories are prepared for customers, validating hardware and software for modern AI deployments. **What You Will be Doing:** * Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters. * Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks. * Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis. * Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform. * Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health. * Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks * Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll. * Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency. * Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use. * Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facin

Free ATS check

Applying for this Senior Solutions Architect - AI Factory Deployment role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

SKILL SIGNAL 46 detected · ranked by frequency

Python ×4

Shell ×4

AI Factory Deployment ×3

NCCL ×3

AllReduce ×3

AllToAll ×3

observability ×3

automation ×3

benchmarking ×3

running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters ×3

using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability ×3

observability and automation to improve benchmarks and validation ×3

Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters ×3

Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks ×3

Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis ×3

Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform ×3

Build and improve observability for AI factories (metrics, logs, traces, dashboards) ×3

Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks ×3

Examine communication patterns and NCCL usage for AI/LLM workloads ×3

Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency ×3

Work closely with hardware, software, networking, datacenter, and product teams ×3

Contribute to documentation, guidelines, and readiness collateral ×3

managing Linux-based systems ×3

running AI/ML workloads on multi-GPU and/or multi-node clusters ×3

collective communication patterns ×3

scripting ×3

tooling ×3

crafting, executing, and interpreting performance benchmarks ×3

troubleshoot and optimize complex distributed workloads ×3

building automation and CI-style pipelines ×3

Linux-based GPU clusters ×2

PyTorch ×2

BEHAVIOURAL

ambitiouswork effectively with cross-functional teams

Role Details

Seniority executive

Experience 6–10 yrs

Level Senior

Type FULL TIME

Education Bachelor’s degree or equivalent experience in Computer Scien

Salary Band 150k-200k

AI-Extracted Insights

Domain Areas

ai-factory-deploymenthpcdistributed-systemsai-ml-settingsllm-trainingllm-inferencecollective-communication-patternsgpu-accelerated-environments

How to Apply on Workday

Workday has a multi-step form — save your progress after every section.
"Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
Job requisition numbers are useful when following up with HR by email.

ANONYMOUS · UNFILTERED

What do employees actually say about NVIDIA?

Real rants from real employees. Read before you apply.

Read Company Rants →