Job Description
Description
:
Essential Responsibilities:
Actively monitor and analyze system metrics to ensure the availability, performance, and reliability of digital platforms and applications.Diagnose and resolve complex system issues, perform root cause analysis, and implement long-term fixes to prevent recurrence.Create and maintain automation scripts, tools, and processes to streamline operations, reduce manual effort, and enhance reliability.Configure and improve monitoring and alerting tools to provide actionable insights into system health and performance.Analyze system usage trends and forecast future resource requirements to ensure scalability and prevent capacity-related issues.Work with development teams to design and implement reliable, fault-tolerant systems, incorporating best practices for high availability.Ensure smooth and reliable software releases by managing continuous integration and co...