Site Reliability Engineer (SRE) / DevOps Engineer

Improving Corporate Services4 months ago
Raleigh, NC, United States
Hybrid
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

This position joins a modern infrastructure and platform team responsible for building, operating, and continuously improving a cloud-based engineering platform that supports an AI-enabled product ecosystem. The mission goes beyond system availability — the team designs resilient, scalable foundations that enable rapid product delivery while maintaining operational excellence and performance.

You’ll partner closely with software engineering teams on release strategies, automation, observability, reliability engineering, and performance optimization. This role carries real ownership of platform architecture and tooling decisions that directly influence scalability, developer productivity, and system stability. The environment emphasizes automation-first practices and AI-accelerated development workflows, evolving infrastructure capabilities to support high deployment velocity, intelligent tooling, and long-term platform reliability.

Key Responsibilities

  • Design, build, and operate scalable cloud infrastructure in AWS or Azure with reliability, security, and automation as core principles
  • Implement and maintain infrastructure-as-code using Terraform, managing environments as versioned, testable systems
  • Build and optimize CI/CD pipelines to enable safe, fast, and repeatable deployments
  • Establish observability practices including monitoring, alerting, and distributed tracing that enable rapid incident detection and response
  • Optimize PostgreSQL performance through schema design, query tuning, indexing strategies, and capacity forecasting
  • Operate containerized workloads using Kubernetes or AWS ECS, supporting deployment automation and runtime stability
  • Collaborate with product engineering teams on resiliency patterns, release strategies, incident response, and post-incident learning
  • Promote a culture where operational reliability and developer velocity reinforce one another

Required Qualifications

  • 2–3 years of hands-on experience in Site Reliability Engineering, DevOps, platform engineering, or infrastructure roles supporting production environments
  • Strong working experience with cloud platforms such as AWS or Azure
  • Experience building and managing infrastructure-as-code, preferably using Terraform
  • Working knowledge of container orchestration platforms such as Kubernetes or AWS ECS
  • Solid database fundamentals including SQL development, schema design, performance tuning, and query optimization
  • Programming experience in at least one language (Python, Go, or TypeScript preferred; depth of skill matters more than the specific language)

Preferred Qualifications

  • Experience with observability tooling such as OpenTelemetry, Prometheus, Datadog, or similar platforms
  • Experience optimizing systems under constraints (cost efficiency, latency, scalability, or resource utilization)
  • Contributions to open-source infrastructure, automation tooling, or reliability engineering projects
  • Background supporting high deployment frequency and fast iteration environments

Required Skills

Resiliency Engineering
TypeScript
CI/CD Pipelines
Container Orchestration (Kubernetes, AWS ECS)
Site Reliability Engineering
Go
Cloud Infrastructure (AWS/Azure)
PostgreSQL Performance Tuning
Incident Response
DevOps
Infrastructure as Code (Terraform)
Observability (Monitoring, Alerting, Distributed Tracing)
Automation
Python