Principal Site Reliability Engineer

Oracle3 months ago
Boise, ID, United States
Remote
Full-time
Junior Level (1-3 years)

Job Description

Job Description

About Oracle Cloud:

Oracle Cloud is a comprehensive suite of cloud services-including infrastructure, platform, and applications-designed to help organizations build, deploy, and manage workloads securely at scale. At Oracle, we are building the most intelligent future of cloud computing. Our team is composed of talented, motivated, and diverse individuals committed to empowering our customers to accomplish their most important missions using Oracle Cloud Fusion Applications. We center our work around our customers needs, striving to continuously enhance our cloud capabilities based on their challenges.

About the Team:

Join the Fusion Site Reliability Engineering Middleware ( FSRE -MW) -a critical group dedicated to maintaining the high availability of Oracles Cloud Fusion Applications. We minimize the frequency and duration of customer-impacting events through large-scale incident management and automation. As a team, we combine the agility of a start-up with the scale and customer focus of a leading enterprise software company.

As a Principal Site Reliability Engineer, you will be a key member of a high-impact team focused on the availability, performance, and operational excellence of Fusion SRE Middleware. You will take ownership of production environments-including systems and the Fusion Middleware stack-and support mission-critical business operations for Cloud Fusion Applications. Your role will emphasize automation and optimization of operations across multiple production environments, recommending AI-driven solutions to enhance availability, performance, and supportability. You will harness AI-based tools and predictive analytics to proactively identify issues, automate incident responses, and continuously improve system resilience. Additionally, you will provide escalation support for complex production problems, guide junior engineers, participate in major incident bridges, and help build and refine processes and procedures using AI-powered insights to drive smarter, data-driven decisions.

Our team is front-and-center in reducing event duration, leveraging operational experience, best practices, and tool development to automate incident management and drive continual improvement.

About the Role:

We seek a Principal SRE to join our globally distributed team, responsible for detecting, triaging, and mitigating service-impacting events rapidly and effectively through automation and AI-powered insights.

Required Skills

Performance Optimization
AI Solutions
Predictive Analytics
Process Improvement
Site Reliability Engineering
Middleware
Automation
Cloud Computing
Incident Management