Site Reliability Engineer (SRE) / DevOps Engineer

Improving Corporate Services4 months ago

Raleigh, NC, United States

Hybrid

Full-time

Junior Level (1-3 years)

Job Description

Position Overview

This position joins a modern infrastructure and platform team responsible for building, operating, and continuously improving a cloud-based engineering platform that supports an AI-enabled product ecosystem. The mission goes beyond system availability — the team designs resilient, scalable foundations that enable rapid product delivery while maintaining operational excellence and performance.

You’ll partner closely with software engineering teams on release strategies, automation, observability, reliability engineering, and performance optimization. This role carries real ownership of platform architecture and tooling decisions that directly influence scalability, developer productivity, and system stability. The environment emphasizes automation-first practices and AI-accelerated development workflows, evolving infrastructure capabilities to support high deployment velocity, intelligent tooling, and long-term platform reliability.

Key Responsibilities

Design, build, and operate scalable cloud infrastructure in AWS or Azure with reliability, security, and automation as core principles
Implement and maintain infrastructure-as-code using Terraform, managing environments as versioned, testable systems
Build and optimize CI/CD pipelines to enable safe, fast, and repeatable deployments
Establish observability practices including monitoring, alerting, and distributed tracing that enable rapid incident detection and response
Optimize PostgreSQL performance through schema design, query tuning, indexing strategies, and capacity forecasting
Operate containerized workloads using Kubernetes or AWS ECS, supporting deployment automation and runtime stability
Collaborate with product engineering teams on resiliency patterns, release strategies, incident response, and post-incident learning
Promote a culture where operational reliability and developer velocity reinforce one another

Required Qualifications

2–3 years of hands-on experience in Site Reliability Engineering, DevOps, platform engineering, or infrastructure roles supporting production environments
Strong working experience with cloud platforms such as AWS or Azure
Experience building and managing infrastructure-as-code, preferably using Terraform
Working knowledge of container orchestration platforms such as Kubernetes or AWS ECS
Solid database fundamentals including SQL development, schema design, performance tuning, and query optimization
Programming experience in at least one language (Python, Go, or TypeScript preferred; depth of skill matters more than the specific language)

Preferred Qualifications

Experience with observability tooling such as OpenTelemetry, Prometheus, Datadog, or similar platforms
Experience optimizing systems under constraints (cost efficiency, latency, scalability, or resource utilization)
Contributions to open-source infrastructure, automation tooling, or reliability engineering projects
Background supporting high deployment frequency and fast iteration environments

Required Skills

Resiliency Engineering

TypeScript

CI/CD Pipelines

Container Orchestration (Kubernetes, AWS ECS)

Site Reliability Engineering

Cloud Infrastructure (AWS/Azure)

PostgreSQL Performance Tuning

Incident Response

DevOps

Infrastructure as Code (Terraform)

Observability (Monitoring, Alerting, Distributed Tracing)

Automation

Python