Site Reliability Engineer
- Define critical performance KPIs, set alert rules and roll-out monitoring dashboards for Production with timely reporting to the stakeholders.
- Assist Prod-Ops team to investigate critical production incidents and come up with root cause analysis and ensure permanent closure of the incidents.
- Analyse patterns of production incidents and set-up appropriate alerting/monitoring mechanisms in the system to catch the issues before hand.
- Work closely with solution architects, application development team to ensure adherence to best practices in design and coding w.r.t SRE principles.
- Assist development team & other relevant teams to tune the applications/configurations for critical systems to comply with the NFR before going live in production and ensure the performance recommendations are part of the change request process.
- Participate & contribute in resiliency validation exercises with proper reporting.
- Improve application stability & operational efficiency by developing scripts to automate tasks.
- Bachelor's Degree of Computer Science with equivalent work experience of 8 years.
- Minimum 2 years of hands on experience in container technology such as Red Hat Openshift, Docker, Kubernetes and DevOps Tools such as Jenkins, Bitbucket, JIRA.
- Minimum 2 years of hands on experience in application monitoring CA Wily, Grafana, Kibana, Prometheus.
- Having 3 years of experience in production support and issue management is a plus.
- Strong analytical and problem-solving skills.
- Strong interpersonal and communication skills.
- Positive attitude towards continuous learning.