Cloud Infrastructure SRE
Alibaba Cloud7 months ago
Sunnyvale, California, United States
Remote
Full-time
Junior Level (1-3 years)
Job Description
Position Overview
Alibaba Cloud Native Observability Team is responsible for observability products including Alibaba Cloud Log Service (SLS), Application Real-Time Monitoring Service (ARMS), and Cloud Monitoring Service (CMS). The team is committed to creating a real‑time, intelligent, and large‑scale observation and analysis platform that drives intelligent operations (AIOps), big data security, and business monitoring to accelerate digital innovation.
Key Responsibilities
- Focus on Alibaba Cloud observability platforms (SLS/CMS/ARMS) in multinational cloud environments to enhance system reliability and engineering delivery efficiency through infrastructure automation and optimized scalable operations.
- Build Automated Operations Systems by designing a reliability engineering framework that includes change management, capacity planning, and self‑healing mechanisms via Infrastructure as Code (IaC).
- Lead standardized observability platform delivery framework design by establishing risk assessment models, error budget mechanisms, and optimizing quality control with automated toolchains.
- Develop an SRE‑Based Metrics System that continuously optimizes service health assessment models and automates tracking of SLOs/SLIs to drive decision‑making with observability data.
Required Qualifications
- Over 3 years of experience in distributed systems reliability engineering, with a strong understanding of high‑availability architecture design and proficiency in at least one of Python, Go, or Java.
- Ability to translate operations experience into automated solutions and familiarity with various observability software and systems.
Preferred Qualifications
- Familiarity with core SRE practices including incident review, error budgeting, and chaos engineering, with experience in building automated risk control systems.
Benefits & Perks
- Compensation: The pay range for this position is expected to be between $104,400 and $171,000 per year. Base pay may vary based on market location, individual experience, and job‑related skills.
- At‑Will Employment: The position is at‑will, and the Company reserves the right to modify base salary or any other discretionary payment or compensation program at any time, based on performance and market factors.
Required Skills
Incident Review
Distributed Systems Reliability Engineering
Observability Platforms (SLS/CMS/ARMS)
Infrastructure as Code (IaC)
Python/Go/Java Proficiency
Error Budgeting
SRE Practices
High-Availability Architecture Design
Chaos Engineering
Infrastructure Automation