Engineering Leader – AI & Machine Learning Operations (AIOps)

CloudBees Inc7 months ago
Los Angeles, California, United States
Remote
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

CloudBees is the leading software delivery platform for modern enterprises, enabling companies to continuously innovate at scale. As a startup in the DevOps space, we empower developers and teams to build, test, and deploy software faster and more reliably. We are seeking a visionary and hands-on Engineering Leader to drive our Agentic & AI Operations (AIOps) strategy—focusing on platform reliability, scalability, and maintainability—and lead a growing team to build robust, scalable AI & ML infrastructure and pipelines.

Key Responsibilities

  • Lead and scale a team responsible for AIOps, including model deployment, monitoring, and lifecycle management.
  • Architect and implement AI/ML pipelines that are scalable, observable, and reproducible.
  • Collaborate with cross-functional teams (data science, DevOps, product) to integrate AI/ML systems into our SaaS platform.
  • Establish best practices for AI/ML experimentation, CI/CD for models, data versioning, and model governance.
  • Own the full stack of AIOps infrastructure, from data ingestion to real-time inference systems.
  • Drive technical vision and roadmap for ML platform development.
  • Act as a mentor and coach, helping engineers grow in a fast-paced, startup environment.
  • Manage a team of 5+ and launch new platforms from 0 to 1, driving adoption internally and externally.

Required Qualifications

  • 7+ years of engineering experience, including platform engineering, system development, or related roles with at least 3 years in leadership.
  • 3 years of experience with large-scale systems emphasizing reliability, scalability, and maintainability; plus 1 year with AI/ML systems.
  • Strong hands-on experience with MLOps tools (e.g., MLflow, Kubeflow, SageMaker, Airflow, Metaflow).
  • Proven track record of building ML pipelines in production environments.
  • Experience with cloud infrastructure (AWS, GCP, or Azure) and container orchestration (Kubernetes).
  • Deep knowledge of CI/CD practices as they relate to the ML lifecycle.
  • Prior experience in a startup or fast-paced SaaS environment.
  • Strong collaboration and communication skills.
  • Experience deploying and managing services such as Amazon Bedrock or Vertex AI for LLMs.

Preferred Qualifications

  • Experience integrating ML capabilities into developer-centric tools or platforms.
  • Familiarity with data observability and ML monitoring tools (e.g., EvidentlyAI, Prometheus/Grafana for models).
  • Knowledge of data privacy, compliance, and security in ML systems.

Benefits & Perks

  • Work at the forefront of DevOps innovation and shape how ML supports developer productivity.
  • Join a high-impact, mission-driven startup backed by top investors.
  • Enjoy a flexible remote work culture with global teammates.
  • Competitive compensation, stock options, and benefits.

Required Skills

System Architecture
DevOps
AI/ML Operations
MLOps Tools (MLflow, Kubeflow, SageMaker, etc.)
Team Leadership
Container Orchestration (Kubernetes)
CI/CD for ML
ML Pipeline Development
Cloud Infrastructure