Site Reliability Engineer/Vice President, Engineering
Drive the convergence of workflow platforms across the firm to promote process consistency and allow for the gathering and analysis of metrics.
The Workflow Engineering team builds world-class technology solutions for automating all kinds of critical business processes across Goldman Sachs. Our platforms manage millions of tasks and business decisions, and run tens of thousands of workflows daily in order to guarantee that vital business operations run on time.
RESPONSIBILITIES AND QUALIFICATIONS
HOW YOU WILL FULFILL YOUR POTENTIAL
- Own runtime environment of a large scale globally distributed platform (1800+ machines)
- Develop forward strategy to migrate to a hybrid cloud runtime
- Balance feature development velocity and reliability with well-defined SLOs.
- Autonomy to prioritize and escalate in order to achieve stated site reliability outcomes
- Create sustainable systems, services and development practices to keep the estate scalable, resilient and available
- Proactively engage and guide development teams to improve the lifecycle of developing and managing highly available systems through assertion of SRE principals
- Passionate about managing operational risk, debugging intricate problems across a distributed stack
SKILLS AND EXPERIENCE WE ARE LOOKING FOR
- 7+ years of experience developing distributed services, deployed across a small to medium runtime estate (200+ machines)
- BS/MS degree in Computer Science or related technical field involving coding and / or systems engineering.
- Proficiency in one or more of the following: Java, C++, Python.
- Deep understanding of Java threading models, JVM performance and tuning
- Hands on experience with Apache Geode or any distributed high availability data caching technologies
- Hands-on experience with development, debugging and optimizing code, as well as automation
- Advanced troubleshooting and debugging skills with JVM thread dumps, heap dumps, etc
- Prior experience in SRE role
- Understanding of distributed databases like Mongo, Cassandra or ElasticSearch
- Understanding of container and container orchestration e.g. Docker, Kubernetes
- Experience with open source messaging like Kafka/ Rabbit MQ etc.
- Understanding of Linux kernel sub-systems
- Working knowledge of solutions and control plane in AWS
- Strong interpersonal skills, drive, and ownership.
- Solving novel problems from first principles.
- Experience with UI frameworks like Angular