HPC Engineer

Sabre Systems Inc.2 months ago
Arlington, VA, United States
On-site
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

Job title: HPC Engineer. Sabre is seeking an HPC Data Storage Engineer to support a mission-critical Department of Defense (DoD) program dedicated to high-performance computing operations. As an HPC Engineer, you will design, optimize, and maintain advanced high-performance computing environments that power large-scale data processing, simulation, and research operations. Your contributions will directly enable advanced data-intensive research efforts that are essential to national defense.

Location: Arlington, VA

Compensation: $90,000.00 - $200,000.00

Key Responsibilities

  • Apply comprehensive knowledge of High Performance Computing (HPC) systems, comprised of high-speed, multi-petabyte Lustre file systems, Red Hat Enterprise Linux (RHEL) servers, CPU/GPU compute nodes, and high performance storage arrays, using Ethernet, fiber, Omni-Path, and InfiniBand interconnections.
  • Provide functional and technical expertise in support of user-developed software and technical advice and leadership to other technical staff.
  • Utilize a wide variety of skills in system and network monitoring; large-scale systems administration; scripting and automation; security compliance; network distributed services; storage and backups; and hardware and software problem diagnosis and resolution.
  • Diagnose and troubleshoot technical problems, often of a complex nature, associated with computer hardware and software interrelationships and dependencies.
  • Conduct needs analysis, planning, and scheduling the installation of a wide variety of new or modified hardware/software.
  • Develop functional and technical IT system requirements and specifications. Configure and optimize system tools and applications, including job schedulers (Slurm and PBSPro) and system resources (GitLab, LUA/TCL modules, and system support applications).
  • Create and brief technical presentations to technical and non-technical stakeholders. Maintain detailed documentation of system configurations, procedures, and troubleshooting guides. Develop user facing documentation.

Required Qualifications

  • Education: Bachelor's in Computer Engineering, Computer Science, or related field and ten or more years of job related experience.
  • Thorough knowledge of complex concepts, practices, and troubleshooting associated with HPC cluster systems design, installation, and maintenance.
  • Advanced knowledge in distributed computing theory, parallel processing, applications, and associated infrastructure.
  • Extensive experience with Linux/Unix systems including installation, configuration, networking, backups, updates and patching, data archiving, and system security.
  • Functional knowledge of HPC middleware, and platform managers such as Bright Cluster Manager; employing job schedulers such as PBS, Slurm, Torque, etc.; and, optimizing job queues.
  • Experience with HPC or large-scale distributed computing environments and technologies such as high-speed low-latency interconnects (e.g. InfiniBand), parallel file systems (e.g. Lustre), and virtualization environments and tools (e.g. VMWare).
  • Experience developing Python/bash/Perl scripts and employing automation frameworks such as Ansible.
  • General knowledge employing Docker containers and Kubernetes ecosystems.
  • Working knowledge in one or more programming languages (e.g. C/C++, Fortran, etc.).
  • Requirements: Active Top Secret DoD security clearance (U.S. Citizenship Required).

Benefits & Perks

  • Benefits: Employee Referral Bonus eligibility (a Level 2 bonus applies for qualifying referrals) and comprehensive benefits supporting health, well-being, and professional growth.

Required Skills

Red Hat Enterprise Linux (RHEL)
Lustre file systems
System optimization and troubleshooting
CPU/GPU compute nodes
Docker and Kubernetes
Parallel processing
Job schedulers (Slurm, PBSPro)
Cluster management
High Performance Computing (HPC) systems
Automation (Ansible)
Scripting (Python, Bash, Perl)
Linux/Unix systems administration