Senior Site Reliability Engineer

Oracle3 months ago
Oklahoma City, OK, United States
Remote
Full-time
Junior Level (1-3 years)

Job Description

Position Overview

As a Senior Site Reliability Engineer (SRE), you will play a key role in ensuring the reliability, performance, and scalability of modern cloud-based AI applications for OCI Operations. This position involves close collaboration with development, operations, and security teams to automate processes, develop SRE standards, monitor system health, and maintain optimal uptime for critical AI applications. You will leverage your technical expertise to design, automate, and maintain AI services supporting mission-critical AI and ML initiatives.

Salary: $74,900 to $158,200 per annum. May be eligible for bonus and equity.

Key Responsibilities

  • Design, implement, and maintain scalable, secure cloud infrastructure for AI applications on OCI.
  • Collaborate with Engineering teams to build robust automation for building, deploying, and scaling resilient systems.
  • Implement site reliability engineering best practices including SLO/SLI definition, error budgeting, automated monitoring, data integrity validation, and incident response for services.
  • Identify opportunities and take ownership of automation and continuous improvement initiatives to run highly scalable, reliable systems.
  • Design and optimize highly available services that are resilient to failures or impacts.
  • Automate infrastructure provisioning and CI/CD deployments using tools like Terraform, Ansible, or other IAC frameworks.
  • Instrument and monitor systems for performance, availability, resource consumption, and latency using observability tools (e.g., Grafana, Prometheus).
  • Troubleshoot and resolve complex issues, conducting root cause analyses and post-incident reviews.
  • Solve complex problems related to infrastructure cloud services and automate common tasks to ensure continuous availability with minimal human intervention.
  • Utilize deep understanding of cloud computing design patterns and dependencies to mitigate major incidents.
  • Advocate for and implement security, governance, and compliance best practices.
  • Mentor team members and promote knowledge sharing around SRE practices and standards.

Required Qualifications

  • Bachelor's or Master's in Computer Science, Engineering, or a related field.
  • 6+ years' experience in cloud engineering, SRE, or DevOps roles with at least 4 years supporting mission-critical systems and/or applications.
  • Experience building high-performance, resilient, scalable, and well-engineered systems.
  • Practical experience designing and operating large-scale cloud-based distributed applications.
  • Strong hands-on skills with infrastructure-as-code (e.g., Terraform), automation (Python/Scala), and containerization (Kubernetes, Docker).
  • Familiarity with AI capabilities including LLM, RAG, and AI Agents.
  • Working knowledge of distributed storage, data formats (Parquet, Avro), and modern analytics platforms.
  • Solid understanding of networking, cloud security, and compliance.
  • Strong analytical, troubleshooting, and communication skills.
  • Experience with disaster recovery, redundancy, and operational uptime planning.
  • Experience with agile software development methodologies.

Preferred Qualifications

  • Preferred certifications: SRE, Cloud Architect/Engineer (OCI, AWS, Azure, GCP), DevOps.
  • Resourcefulness in the face of unique constraints.
  • Always iterating on ways to be more productive and effective.
  • Ability to capture and prioritize automation of toil tasks.
  • General problem solving skills, critical thinking, and attention to detail.
  • Eagerness to learn and to teach.

Benefits & Perks

  • Medical, dental, and vision insurance, including expert medical opinion.
  • Short term and long term disability coverage.
  • Life insurance and AD&D.
  • Supplemental life insurance (Employee/Spouse/Child).
  • Health care and dependent care Flexible Spending Accounts.
  • Pre-tax commuter and parking benefits.
  • 401(k) Savings and Investment Plan with company match.
  • Paid time off with flexible vacation policies and accrual based on hours worked.
  • 11 paid holidays.
  • 72 hours of paid sick leave annually, with a rollover cap.
  • Paid parental leave and adoption assistance.
  • Employee Stock Purchase Plan.
  • Financial planning and group legal services.
  • Voluntary benefits including auto, homeowner, and pet insurance.

Required Skills

Python
Terraform
Ansible
Docker
Security and Compliance
Scala
Automation
Kubernetes
Cloud Architecture
CI/CD
Prometheus
Grafana
Incident Response