Cloud Infrastructure SRE

Alibaba Cloud9 months ago

Sunnyvale, California, United States

Remote

Full-time

Junior Level (1-3 years)

Job Description

Position Overview

Alibaba Cloud Native Observability Team is responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). The team is committed to creating a real‑time, intelligent, and large‑scale observation and analysis platform that drives intelligent operations (AIOps), big data security, and business monitoring to accelerate digital innovation.

Key Responsibilities

Focus on Alibaba Cloud observability platforms (SLS/CMS/ARMS) in multinational cloud environments to enhance system reliability and engineering delivery efficiency through infrastructure automation and optimized scalable operations.
Build Automated Operations Systems by designing a reliability engineering framework that includes change management, capacity planning, and self‑healing mechanisms via Infrastructure as Code (IaC).
Lead standardized observability platform delivery framework design by establishing risk assessment models, error budget mechanisms, and optimizing quality control with automated toolchains.
Develop an SRE‑Based Metrics System that continuously optimizes service health assessment models and automates tracking of SLOs/SLIs to drive decision‑making with observability data.

Required Qualifications

Over 3 years of experience in distributed systems reliability engineering, with a strong understanding of high‑availability architecture design and proficiency in at least one of Python, Go, or Java.
Ability to translate operations experience into automated solutions and familiarity with various observability software and systems.

Preferred Qualifications

Familiarity with core SRE practices including incident review, error budgeting, and chaos engineering, with experience in building automated risk control systems.

Benefits & Perks

Compensation: The pay range for this position is expected to be between $104,400 and $171,000 per year. Base pay may vary based on market location, individual experience, and job‑related skills.
At‑Will Employment: The position is at‑will, and the Company reserves the right to modify base salary or any other discretionary payment or compensation program at any time, based on performance and market factors.

Required Skills

Incident Review

Distributed Systems Reliability Engineering

Observability Platforms (SLS/CMS/ARMS)

Infrastructure as Code (IaC)

Python/Go/Java Proficiency

Error Budgeting

SRE Practices

High-Availability Architecture Design

Chaos Engineering

Infrastructure Automation