Machine Learning Engineer - Training & Infrastructure

P-1 AI9 months ago

San Francisco, California, United States

Hybrid

Full-time

Junior Level (1-3 years)

Job Description

Position Overview

P-1 AI is on a mission to build an engineering AGI that helps mankind conquer and shape the physical world. Our flagship product, Archie, is an AI engineer capable of quantitative and spatial reasoning over physical product domains, performing at the level of an entry-level design engineer. Backed by a $23 million seed round led by Radical Ventures—with participation from luminaries in AI and industry—we are determined to put an Archie on every engineering team across industrial companies.

We are seeking an experienced engineer to take ownership of LLM training operations within our applied research team. In this role, your focus will be on ensuring large-scale GPU training runs reliably, efficiently, and fast on our dedicated mid-size GPU cluster and potentially on cloud platforms. You will collaborate closely with researchers and ML engineers in scaling experiments across multi-node GPU clusters—from debugging NCCL deadlocks to optimizing FSDP configurations.

Key Responsibilities

Own the training pipeline for large-scale LLM fine-tuning and post-training workflows
Configure, launch, monitor, and debug multi-node distributed training jobs using FSDP, DeepSpeed, or custom wrappers
Contribute to upstream and internal forks of training frameworks like TorchTune, TRL, and Hugging Face Transformers
Tune training parameters, memory footprints, and sharding strategies for optimal throughput
Collaborate with infrastructure and systems teams to maintain the health and utilization of our GPU clusters (e.g., Infiniband, NCCL, Slurm, Kubernetes)
Implement features or fixes to unblock novel use cases in our LLM training stack

Required Qualifications

3+ years working with large-scale ML systems or training pipelines
Deep familiarity with PyTorch, particularly distributed training via FSDP, DeepSpeed, or DDP
Proficiency with training libraries such as TorchTune, Accelerate, or Trainer APIs
Hands-on experience with multi-node GPU training, including profiling, debugging, and optimization
Understanding of low-level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
A passion for bridging research with engineering to turn concepts into robust, high-performance systems

Preferred Qualifications

Experience maintaining Slurm, Ray, or Kubernetes clusters
Contributions to open-source ML training frameworks
Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
Familiarity with on-policy reinforcement learning setups with inference (policy rollouts) such as GRPO, PPO, or A2C
Startup experience

Required Skills

Slurm

NCCL

GPU Clusters

Distributed Training

DeepSpeed

PyTorch

CUDA

Model Partitioning

Troubleshooting

FSDP

Kubernetes

Optimization