Shield AI

deep-tech

PrincipalEngineer,AIInfrastructure

$320–490k San Francisco, California, United States FULL TIME

The Brief

“Principal Engineer, AI Infrastructure at Shield AI. Skills: AI Infrastructure, ML Infrastructure at Scale, Data Platform, Compute Strategy, MLOps, Model Deployment. Define and operate the core AI and data platform across training, simulation, data management, evaluation, and deployment. Own where and how workloads run across on-premise, cloud, and hybrid environments”

What You'll Achieve.

Faster iteration from idea to trained model to evaluated result; High utilization of compute resources with clear visibility into usage and cost; Simulation capacity that supports large-scale training without bottlenecks; Consistent end-to-end lifecycle: development, evaluation, deployment, monitoring, and retraining; Repeatable data loop: telemetry, scenario extraction, retraining, and redeployment; Reliable deployment of optimized models to edge systems; Broad platform adoption across autonomy programs; Repeatable approach for deploying AI infrastructure in customer environments; Representative performance targets: Training iteration cycles measured in days, not weeks; Sustained high utilization of GPU resources under production workloads

Industry & Context.

deep tech

Problems you'll solve

Ability to debug and resolve system issues when needed

Eligibility Requirements

classified systems, air-gapped systems, SCIFs

What They're Looking For.

Must Have

Experience building and operating ML infrastructure at scale (100+ GPU clusters, distributed systems), Experience defining compute strategy, including on-premise vs cloud tradeoffs, capacity planning, and cost management, understanding of ML workloads, including foundation models, RL/MARL, simulation-based training, and fine-tuning, Experience building data platforms with dataset versioning, lineage, and cataloging, Ability to debug and resolve system issues when needed

Nice to Have

Experience in defense or classified environments (e.g., air-gapped systems, SCIFs), Experience with simulation-heavy ML systems (robotics, autonomy, or similar domains), Experience deploying and optimizing models for edge hardware, Familiarity with HPC systems (schedulers, parallel storage, high-speed networking)

What You'll Do.

Define and operate the core AI and data platform across training

Own where and how workloads run across on-premise

and hybrid environments

Drive capacity planning

and cost-per-compute decisions

including support for classified and air-gapped systems

Build infrastructure for distributed training (supervised learning

foundation models) and large-scale

multi-fidelity simulation

Ensure training and simulation systems operate together without bottlenecks

Ingest and manage multi-modal sensor data (EO

Establish dataset versioning

and classification-aware storage and access controls

Establish a consistent workflow for experiment tracking

and automated validation

Implement evaluation and V&V gates so models meet defined standards before deployment

Own the pipeline from training to deployment

including model optimization (e.g.

deployment to edge systems

and retraining triggers

Define how AI infrastructure is deployed in customer environments across on-premise

and sovereign settings

Establish a consistent approach that avoids one-off solutions while adapting to operational constraints

and workflows across teams

Reduce duplication while maintaining flexibility where needed

Work directly with Hivemind and other autonomy teams to ensure the platform supports real workloads and evolves with program needs

How You'll Work.

Team & Collaboration

Cross-Team Partnership

Free ATS check

Applying for this Principal Engineer, AI Infrastructure role?

Most applicants get filtered before a human reads their resume. See if yours makes the cut.

Should you apply? AI reads your resume vs this job — match score, gaps to address, ATS keywords.

Skill Signal 43 detected

Core

ML Infrastructure at Scale ×5

Compute Strategy ×5

model optimization ×4

HPC systems ×4

Required

AI Infrastructure ×3

Data Platform ×3

MLOps ×3

capacity planning ×3

cost management ×3

ML workloads ×3

data platforms ×3

dataset versioning ×3

lineage ×3

cataloging ×3

Nice to have

distributed systems

foundation models

RL/MARL

supervised learning

simulation

distillation

quantization

pruning

edge systems

schedulers

Behavioural

Cross-Team Partnership

Role Details

Work Mode

Type

FULL TIME

Experience

8–15 yrs

Salary Band

200k+

AI-Extracted Insights

Domain Areas

defense-applications

autonomy-systems

air

maritime

and-space-platforms

complex-and-contested-environments

simulation-driven-development

multi-modal-sensor-data

How to Apply on Lever

Lever uses a streamlined one-page form — apply in under 5 minutes.
LinkedIn import works well; review parsed data before submitting.
The cover letter field is optional but visible to reviewers — use it to differentiate.
Referral codes from employees can significantly boost visibility of your application.

ANONYMOUS · UNFILTERED

What do employees actually say about Shield AI?

Real rants from real employees. Read before you apply.

Read Company Rants →