AVP, Site Reliability Engineer (Core Banking), Group Consumer Banking and Big Data Analytics Technology, Technology & Operations
Business Function Group Technology and Operations (T&O) enables and empowers the bank with an efficient, nimble and resilient infrastructure through a strategic focus on productivity, quality & control, technology, people capability and innovation. In Group T&O, we manage the majority of the Bank's operational processes and inspire to delight our business partners through our multiple banking delivery channels. Key Accountabilities
- Build and maintain Production monitoring and automation solutions
- Automation of manual tasks in a CORE Banking ecosystem
- Implement Site Reliability Engineering principles with regards to performance, reliability, monitoring, alerting and maintenance in Production environment
- Capacity monitoring & Observability of production Infrastructure, automated alerting, performance monitoring and reporting tools
- Build and implement Service improvements and Machine Learning models
- Manage identified Production applications, identify, measure and report performance trends and KPIs periodically - report SLI, SLO, SLA measures and improve systems performance and associated performance KPIs
- Production systems performance and KPIs monitoring
- Deployment automation an allied improvement
- Conceptualize, design, develop and maintain production monitoring and Machine Learning based predictive automation solutions/ applications in a CORE Banking Production environment.
- Production automation. Automation of manual activities /processes for Production teams.
- SRE. Implement Site Reliability Engineering principles regarding performance, reliability, monitoring, alerting in Production environment
- Capacity monitoring & Observability of production Infrastructure, automated alerting, performance monitoring and reporting tools. Conduct periodic review of system performance for capacity planning and identification of system improvements
- Build, monitor and maintain Machine Learning models from scratch.
- Develop auto-healing solutions in production environment to enable efficient and timely service restorations of critical processes by auto-escalation of incidents, non-performant KPIs and underlying remedial actions
- Data handling - ingestion, cleansing, storage, visualization, monitoring & alerting and analytics
- Data analysis to find patterns in data using tools and coming up with optimum solutions that are predictive and provides insights
- Build and implement Service improvements. Identify, measure and report performance trends - SLIs/ SLOs/ SLAs periodically and improve systems performance and associated performance KPIs
- Production batch and incidents trending and measuring systems performance against KPIs
- Automation of system health check and monitoring of production system SLIs and SLOs to ensure SLA is met
- Provide continuous monitoring and improvement of systems - job automation, performance tuning, capacity planning.
- Identify persistent or recurring problems and recommend creative solutions.
- Communicate proactively and provide regular update to the stakeholders. Proven ability to communicate with peers and mentor junior developers.
- Ensure Preventive and detective measures of the applications are identified and implemented.
- 6 - 12 years of total IT experience in SRE and Production automation experience in a Banking and Financial services environment. Experience gained in the SRE team, a good understanding of SRE concepts and principles regarding performance, reliability, monitoring, alerting.
- 3+ years of experience in a professional production environment.as a developer in Python & allied libraries like Pandas/ Matplotlib/ Seaborn/Scikit-learn.
- Proven ability of having conceptualised, developed and implemented 2 end-to-end Predictive Machine Learning models using algorithms like Regression, Decision Trees/ Random Forest, Bagging and Boosting algorithms, Unsupervised learning algorithms, Time-series etc. in a production environment.
- Proven ability to have implemented/ conceptualized/ maintained an ELK based (or equivalent central logging/ monitoring/ predictive applications) in production environment would be an added advantage.
- Production automation. Automation of manual activities /processes for Production teams. (Automation experience required)
- Good experience in running automation and service improvements experience
- Capacity monitoring & Observability. Good level of command over production Infra; performance monitoring and reporting tools
- Hands-on Engineering/ Development experience working on production systems automation in Banking systems - architecture design, development, integration, customization & implementation.
- Ability to write clear and concise documentation (such as requirements, design and testing procedures)
- Strong technical/ programming skills. Knowledge of additional programming languages - NoSQL, Java, Python an added advantage.
- Data handling tools
- Software version control tools (Git)
- Experience using and optimizing monitoring and trending systems (Prometheus, Grafana), log aggregation systems (ELK, Splunk) and their agents
- Expert level experience in conceptualization, design, development, testing, implementation and maintenance of Elasticsearch, Logstash, Grafana/ Kibana, NoSQL, Java applications in production environment
- Familiar with - MariaDB, Application Server like JBoss, Any cloud platform, Shell scripting, SQL
- Good to have - Working knowledge of ELK and Java development practices, JAVA, .NET, Oracle, Tivoli, Websphere MQ, web services, XML, AIX, Linux
- Familiar with applications Xcelerate/TBMS systems, StreamServe, WAS, Oracle PL/SQL, MS SQL, Java, Apache Tomcat, AIX, Linux.
- Ability to work with stakeholders to stretch his role in depth/width
- Present facts and recommendations effectively in oral and written form
- Good knowledge of development practices and ability to write clear and concise documentation for requirements, design and testing procedures
- Pro-active, independent, resourceful and Strong team player, effective at communicating internationally and used to working closely with remote teams and peers.
- High attention to detail with focus on understanding the issues with finding solutions
- Possess excellent verbal and written communication skills
- Demonstrate ownership and responsibility in all assignments
We offer a competitive salary and benefits package and the professional advantages of a dynamic environment that supports your development and recognises your achievements.