Shield AI

deep-tech

PrincipalEngineer,AIInfrastructure

$320–490k San Francisco, California, United States FULL TIME
The Brief

“Principal Engineer, AI Infrastructure at Shield AI. Skills: AI Infrastructure, ML Infrastructure at Scale, Data Platform, Compute Strategy, MLOps, Model Deployment. Define and operate the core AI and data platform across training, simulation, data management, evaluation, and deployment. Own where and how workloads run across on-premise, cloud, and hybrid environments”

What You'll Achieve.

Faster iteration from idea to trained model to evaluated result; High utilization of compute resources with clear visibility into usage and cost; Simulation capacity that supports large-scale training without bottlenecks; Consistent end-to-end lifecycle: development, evaluation, deployment, monitoring, and retraining; Repeatable data loop: telemetry, scenario extraction, retraining, and redeployment; Reliable deployment of optimized models to edge systems; Broad platform adoption across autonomy programs; Repeatable approach for deploying AI infrastructure in customer environments; Representative performance targets: Training iteration cycles measured in days, not weeks; Sustained high utilization of GPU resources under production workloads

Industry & Context.

deep tech
Problems you'll solve

Ability to debug and resolve system issues when needed

Eligibility Requirements

classified systems, air-gapped systems, SCIFs

What They're Looking For.

Must Have

Experience building and operating ML infrastructure at scale (100+ GPU clusters, distributed systems), Experience defining compute strategy, including on-premise vs cloud tradeoffs, capacity planning, and cost management, understanding of ML workloads, including foundation models, RL/MARL, simulation-based training, and fine-tuning, Experience building data platforms with dataset versioning, lineage, and cataloging, Ability to debug and resolve system issues when needed

Nice to Have

Experience in defense or classified environments (e.g., air-gapped systems, SCIFs), Experience with simulation-heavy ML systems (robotics, autonomy, or similar domains), Experience deploying and optimizing models for edge hardware, Familiarity with HPC systems (schedulers, parallel storage, high-speed networking)

What You'll Do.

Define and operate the core AI and data platform across training

Own where and how workloads run across on-premise

and hybrid environments

Drive capacity planning

and cost-per-compute decisions

including support for classified and air-gapped systems

Build infrastructure for distributed training (supervised learning

foundation models) and large-scale

multi-fidelity simulation

Ensure training and simulation systems operate together without bottlenecks

Ingest and manage multi-modal sensor data (EO

Establish dataset versioning

and classification-aware storage and access controls

Establish a consistent workflow for experiment tracking

and automated validation

Implement evaluation and V&V gates so models meet defined standards before deployment

Own the pipeline from training to deployment

including model optimization (e.g.

deployment to edge systems

and retraining triggers

Define how AI infrastructure is deployed in customer environments across on-premise

and sovereign settings

Establish a consistent approach that avoids one-off solutions while adapting to operational constraints

and workflows across teams

Reduce duplication while maintaining flexibility where needed

Work directly with Hivemind and other autonomy teams to ensure the platform supports real workloads and evolves with program needs

How You'll Work.

Team & Collaboration

Cross-Team Partnership

Free ATS check

Applying for this Principal Engineer, AI Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

How to Apply on Lever

  • Lever uses a streamlined one-page form — apply in under 5 minutes.
  • LinkedIn import works well; review parsed data before submitting.
  • The cover letter field is optional but visible to reviewers — use it to differentiate.
  • Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about Shield AI?

Real rants from real employees. Read before you apply.

Read Company Rants →