WazifaME - Best Jobs in MENA Region

The Site Reliability Engineer is responsible for the proactive support of products to ensure high product performance, with a continuous focus on improvement. The role involves identifying and resolving the root causes of operational incidents, implementing solutions to enhance stability, and preventing recurrence.

The Site Reliability Engineer manages the creation and maintenance of the event catalogue to trigger events and develops both manual remediation approaches and automated workflows to address alerts. Additionally, they oversee the deployment of IT services and solutions, ensuring seamless integration with minimal disruption.

WHAT YOU’LL DO

Design, build, and maintain support systems to ensure high availability, scalability, and performance of critical infrastructure.
Lead incident response and root cause analysis for system failures, including problem investigations and coordination with relevant teams.
Implement and manage automation for system provisioning, deployment, self-healing, and performance monitoring to increase operational efficiency.
Establish and monitor SLIs/SLOs, proactively identify performance issues, and drive continuous improvements in service reliability.
Collaborate with development and operations teams to embed reliability best practices and evolve toward zero-downtime architecture.
Manage and optimize an event catalog, including event definitions, thresholds, remediation actions, and relevance across products.
Develop event response protocols, provide training, and ensure efficient handling of incidents across teams.

Site Reliability Engineer

Job Summary

Careers

Site Reliability Engineer

Job Summary