Senior GPU Infrastructure Engineer II

DigitalOcean7 months ago

San Francisco, California, United States

Remote

Full-time

Junior Level (1-3 years)

Job Description

Position Overview

Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you'll find your place here. As a Bare Metal GPU Infrastructure Engineer at DigitalOcean, you will join a dynamic team dedicated to revolutionizing cloud computing. You'll be part of a highly productive team that runs and supports the GPU Bare Metal service across multiple regions and reports directly to the Sr Manager of GPU Compute within the IaaS organization.

Key Responsibilities

Contribute to a rapidly growing Bare Metal GPU product by providing security and operational best practices to a fleet of infrastructure servers across multiple regions.
Help design and implement further self-service capabilities for customers by providing reliable and predictable API features for upstack service teams.
Engage in support escalations when necessary, capture trends, and lead internal projects to improve the overall product experience.
Continuously test hardware platforms to identify performance regressions related to firmware, software, or hardware issues.

Required Qualifications

Proven ability to orchestrate bare metal Linux systems at scale, including building automation for firmware updates, BIOS configuration management, and configuring PXE environments.
Deep Linux systems experience with low-level troubleshooting, configuration management, security best practices, and monitoring and alerting.
Strong automation mindset with expert knowledge in one or more orchestration tools such as MaaS, Salt, Chef, Ansible, or Puppet.
Excellent communication skills with the ability to write detailed documentation or lead knowledge sharing sessions with operations teams.

Preferred Qualifications

Hands-on experience in High Performance Computing (HPC) clustered environments from Nvidia or AMD, including performing automated wide-scale testing on NCCL or similar frameworks.
Network engineering experience with VyOS platforms.

Benefits & Perks

Innovative Purpose: Be a part of a cutting-edge technology company that simplifies cloud and AI, empowering builders to change the world.
Career Development: Collaborate with some of the smartest minds in the industry and take advantage of reimbursement for relevant conferences, training, education, and access to over 10,000 LinkedIn Learning courses.
Wellness & Flexibility: Enjoy a competitive array of benefits including a one-time work from home stipend, wellness allowance, and flexible time off policy.
Compensation & Rewards: Competitive base salary of $178,000.00 - $225,000.00, potential bonuses based on company and individual performance, plus equity compensation options.
Diversity & Inclusion: Join an organization that values diverse perspectives and fosters an inclusive environment for all.
Remote Role: This position is fully remote, offering the flexibility to work from anywhere.

Required Skills

Firmware and BIOS configuration

Automation using MaaS, Salt, Chef, Ansible, or Puppet

PXE environment configuration

High Performance Computing (HPC) with Nvidia/AMD

Security best practices

Hardware performance testing

Bare metal Linux systems management

Linux troubleshooting and low-level system analysis

Wide-scale automated testing (e.g., NCCL)

Network engineering with VyOS

API design for self-service capabilities

Configuration management