Machine Learning Engineer - Training & Infrastructure

P-1 AI9 months ago
San Francisco, California, United States
Hybrid
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

P-1 AI is on a mission to build an engineering AGI that helps mankind conquer and shape the physical world. Our flagship product, Archie, is an AI engineer capable of quantitative and spatial reasoning over physical product domains, performing at the level of an entry-level design engineer. Backed by a $23 million seed round led by Radical Ventures—with participation from luminaries in AI and industry—we are determined to put an Archie on every engineering team across industrial companies.

We are seeking an experienced engineer to take ownership of LLM training operations within our applied research team. In this role, your focus will be on ensuring large-scale GPU training runs reliably, efficiently, and fast on our dedicated mid-size GPU cluster and potentially on cloud platforms. You will collaborate closely with researchers and ML engineers in scaling experiments across multi-node GPU clusters—from debugging NCCL deadlocks to optimizing FSDP configurations.

Key Responsibilities

  • Own the training pipeline for large-scale LLM fine-tuning and post-training workflows
  • Configure, launch, monitor, and debug multi-node distributed training jobs using FSDP, DeepSpeed, or custom wrappers
  • Contribute to upstream and internal forks of training frameworks like TorchTune, TRL, and Hugging Face Transformers
  • Tune training parameters, memory footprints, and sharding strategies for optimal throughput
  • Collaborate with infrastructure and systems teams to maintain the health and utilization of our GPU clusters (e.g., Infiniband, NCCL, Slurm, Kubernetes)
  • Implement features or fixes to unblock novel use cases in our LLM training stack

Required Qualifications

  • 3+ years working with large-scale ML systems or training pipelines
  • Deep familiarity with PyTorch, particularly distributed training via FSDP, DeepSpeed, or DDP
  • Proficiency with training libraries such as TorchTune, Accelerate, or Trainer APIs
  • Hands-on experience with multi-node GPU training, including profiling, debugging, and optimization
  • Understanding of low-level components like NCCL, Infiniband, CUDA memory, and model partitioning strategies
  • A passion for bridging research with engineering to turn concepts into robust, high-performance systems

Preferred Qualifications

  • Experience maintaining Slurm, Ray, or Kubernetes clusters
  • Contributions to open-source ML training frameworks
  • Exposure to model scaling laws, checkpointing formats (e.g., HF sharded safetensors vs. distcp), or mixed precision training
  • Familiarity with on-policy reinforcement learning setups with inference (policy rollouts) such as GRPO, PPO, or A2C
  • Startup experience

Required Skills

Slurm
NCCL
GPU Clusters
Distributed Training
DeepSpeed
PyTorch
CUDA
Model Partitioning
Troubleshooting
FSDP
Kubernetes
Optimization