NVIDIA
SeniorSolutionsArchitect-AIFactoryDeployment
Neural analysis suggests this role is
optimal for Senior candidates.
“Senior Solutions Architect - AI Factory Deployment at NVIDIA. Skills: AI Factory Deployment, Linux-based GPU clusters, NCCL, AllReduce, AllToAll, Python, Shell, observability, automation, benchmarking. develop, deploy, and validate AI factories end to end. running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters”
What You'll Achieve.
improve performance and scalability; improve benchmarks and validation; prepare AI factories for customers; improve throughput, latency, and scaling efficiency; prepare AI factories for customer use
Industry & Context.
Serve as the expert when workloads or benchmarks do not perform flawlessly; Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform; troubleshoot and optimize complex distributed workloads
What They're Looking For.
Must Have
Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field, More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings, Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL, Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training, Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow, Proficiency with Python and Shellash for scripting, automation, and tooling, Experience with benchmarking (crafting, executing, and interpreting performance benchmarks), Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads, communication skills and the ability to work effectively with cross-functional teams
Nice to Have
Experience with AI factory or large-scale AI infrastructure build, deployment, or operations, Background in HPC performance engineering, SRE, or systems performance analysis for GPU-accelerated environments, Familiarity with observability stacks (e. g. , metrics/monitoring, logging, tracing systems) used for large distributed systems, Experience building automation and CI-style pipelines for running and validating benchmarks at scale, Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions
What You'll Do.
and validate AI factories end to end
running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters
using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability
bring to bear observability and automation to improve benchmarks and validation
serve as the expert when workloads or benchmarks do not perform flawlessly
ensure AI factories are prepared for customers
validating hardware and software for modern AI deployments
and verify AI factory environments across multi-GPU and multi-node Linux clusters
Ensure configurations align with guidelines for NCCL
and distributed training frameworks
Own the execution of key AI/LLM benchmarks
Investigate and resolve issues when training jobs or benchmarks fail
Build and improve observability for AI factories (metrics
Develop automation (Python
Shell) for running benchmarks
and performing regression checks
Examine communication patterns and NCCL usage for AI/LLM workloads
concentrating on collectives such as AllReduce and AllToAll
Recommend changes to job configuration
parallelism strategies
and cluster settings to improve throughput
and scaling efficiency
Work closely with hardware
and product teams to prepare AI factories for customer use
Contribute to documentation
and readiness collateral that support internal collaborators and customer-facing teams
How You'll Work.
Team & Collaboration
collaborate across NVIDIA; Work closely with hardware, software, networking, datacenter, and product teams; work effectively with cross-functional teams
Communication Scope
communication skills; ability to work effectively with cross-functional teams
Full Job Description
We are seeking an ambitious Senior Solutions Architect - AI Factory Deployment to join our NVIDIA Infrastructure Specialists team in Santa Clara! This role is uniquely positioned to develop, deploy, and validate AI factories end to end. You will focus on running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters, using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability. As part of our world-class team, you will bring to bear observability and automation to improve benchmarks and validation. You will serve as the expert when workloads or benchmarks do not perform flawlessly. You will collaborate across NVIDIA to ensure AI factories are prepared for customers, validating hardware and software for modern AI deployments. **What You Will be Doing:** * Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters. * Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks. * Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis. * Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform. * Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health. * Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks * Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll. * Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency. * Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use. * Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facin
Applying for this Senior Solutions Architect - AI Factory Deployment role?
Most applicants get filtered before a human reads their resume. See if yours makes the cut.
How to Apply on Workday
- Workday has a multi-step form — save your progress after every section.
- "Apply With LinkedIn" can fail or lose data; manual entry is more reliable.
- Watch for the "Submit for Review" final step — hitting "Save" alone does not submit.
- Job requisition numbers are useful when following up with HR by email.
ANONYMOUS · UNFILTERED
What do employees actually say about NVIDIA?
Real rants from real employees. Read before you apply.