VP / AVP, Senior SRE, Group Consumer Banking and Big Data Analytics Technology, Technology & Operations
Business Function Group Technology and Operations (T&O) enables and empowers the bank with an efficient, nimble and resilient infrastructure through a strategic focus on productivity, quality & control, technology, people capability and innovation. In Group T&O, we manage the majority of the Bank's operational processes and inspire to delight our business partners through our multiple banking delivery channels. Responsibilities Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. This position is for a Site Reliability Engineer responsible for the development and implementation of processes necessary to improve application / system reliability along with operational support. The position would comprise of approximately equal focus on both software development and operation disciplines. This position will also develop software to automate operational processes along with coding for the shared engineering backlog deliverables.
- Build and maintain enterprise observability infrastructure
- Establish SLI, SLO for enterprise applications, calculate error budgets, MTTD, and MTTR. Educate and implement observability culture in Dev community and assist them identifying golden signals
- Responsible for the availability, performance, change management, monitoring, and capacity management of their services.
- Incident manage, troubleshoot business critical incidents, conduct post-postmortems and ensure permanent closure of the incidents.
- Analyze patterns of production incidents, develop permanent remediation plans, and implement automation to prevent future incidents from occurring through software engineering
- Implement and integrate micro service application with monitoring/logging tools like ELK, Grafana, AppDynamics, Alog and etc.
- Engage with both the development and support teams throughout the life cycle to help build for reliability. Close working collaboration with them to maintain and improve the service against established Service Level Objectives by applying software engineering principles.
- Contribute to design and architecture towards a highly resilient open source stack based micro service application. Enhance, optimize and migrate to new solutions if required.
- Manage the efforts to split between manual operational work and engineering work.
- Work with partner organizations and vendors to provide solutions to current business issues.
- Participate in a shift model covering 24x7x365 support.
Apply Now We offer a competitive salary and benefits package and the professional advantages of a dynamic environment that supports your development and recognises your achievements.
- At least a Degree in Computing / Computer Science / Engineering from a reputed University
- Minimum 3 years experience on SRE role, with a good track record in a leadership role with a culture of collaboration and teamwork
- Good knowledge of Linux, Kubernetes and Python programming knowledge
- Experience with ELK, Kafka & Grafana
- Experience in all aspects of technology like business applications, middle-ware, database technology, best practices, quality improvements and productivity improvements
- Experience in designing and architecting a highly resilient open-source stack based micro service application in Kubernetes or public cloud
- Working experience in production support and improvement, incident management, and automation is a must.
- Experience in identifying golden signals, defining SLI, SLO for enterprise applications, calculate error budgets, MTTD, and MTTR.
- Experience with CI/CD pipelines and tool sets like bitbucket, Jenkins, SonarQube, JIRA, Nexus, etc.; and blue/green, feature toggling, ACL, and other deployment methods to mitigate change risk and address special needs.
- Strong Problem-Solving skills and ability to solve unstructured problem and challenge status quo
- Must be comfortable working in an extremely fast paced environment, with an ability to priorities accordingly to meet deadlines
- Strong communication and interpersonal skills. Self-driven, committed, and reliable team player. Ability to contribute to discussions on design and strategy