DevOps engineer (observability products)
About the Company
Morgan Stanley is a leading global financial services Firm providing a wide range of investment banking, securities, investment management and wealth management services. The Firm’s employees serve clients worldwide including corporations, governments and individuals from more than 747 offices in 42 countries. As a market leader, the talent and passion of our people is critical to our success. Together, we share a common set of values rooted in integrity, excellence and strong team ethic. Morgan Stanley can provide a superior foundation for building a professional career - a place for people to learn, to achieve and grow. A philosophy that balances personal lifestyles, perspectives and needs is an important part of our culture.
The ESM (Enterprise System Management) department is responsible for all aspects of managing and maintaining a set of enterprise hosted observability product that are used throughout Morgan Stanley?s global technology environment.
These include open source, 3rd party vendor tools, and proprietary developed software to provide options for metrics, logs, traces with alerting and visualization interfaces. ESM uses a blend of on-prem Morgan Stanley infrastructure and externally hosted cloud environments. Our services are critical to the stable running of Morgan Stanley?s large technology footprint across Linux, Windows, web stack, containerized, public cloud platforms and more.
The DevOps Site Reliability Engineering team (SRE) for ESM drive the reliability, recoverability and operational efficiency of this product portfolio. Reporting to the SRE Lead, key features of this role include defining how the observability toolchain itself can be monitored (monitor-the-monitor), troubleshooting complex systems, task automation, and technical debt management. Federated metric and logging plants are of considerable size and require innovative ways of maintaining them at the lowest possible effort. Trading systems and other front office functions have a real-time dependency on their observability and require high levels of reliability with fast incident response when issues arise.
Candidates will have the technical skills required to support observability products predominantly on a Linux platform. Prior task automation experience in at least one programming language is expected. Hands-on experience with at least one pillar of observability is required and ideally experience in defining system monitoring and threshold tuning, not just reacting to alerts. Prior high severity/priority incident experience a distinct advantage.
- Building and maintaining knowledge front to back of Morgan Stanley?s observability suite of products, and then specializing in two or three of them
- Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with product delivery engineers
- Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
- Identification and prioritization of technical debt that risks instability or creates wasteful operational toil
- Provide Asia time zone coverage for the 2 or 3 most critical observability products prior to India start of day
- Assisting the broader SRE and application development community at Morgan Stanley on the optimal use of available observability solutions (how to choose, how to use)
- Being operationally responsive, including sharing on-call rotation with the rest of a large, global team (with a time-off in lieu system) Qualifications:
Required Qualifications / Skills
- Strong Linux troubleshooting skills
- Task automation experience in any programming language
- Practical experience of at least one pillar of observability (metrics, logs or traces)
- Exhibit working knowledge in at least ONE of the following areas
o REST services (API)
o Load balancing and networking
o Performance troubleshooting and resolution
o Security/authentication layers such as Kerberos, SiteMinder or SPNEGO and SSL/TLS
o Web app hosting technology, such as Apache and Tomcat
- Confident collaboration skills
- Python development for task automation
- Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
- Knowledge of any of these technologies:
o Docker, Kubernetes, Helm
o Ansible, Terraform
Equal Opportunity Statement
“At Morgan Stanley, diversity is an opportunity—for clients, employees and firm. By valuing diverse perspectives, we can better serve our clients, while we help employees achieve their professional objectives. A corporate culture that is open and inclusive is fundamental to our role as a global leader constantly striving for excellence in all that we do." ―James P. Gorman
It is the policy of the Firm to ensure equal employment opportunity without discrimination or harassment on the basis of race, color, religion, creed, age, sex, gender, gender identity or expression, sexual orientation, national origin, citizenship, disability, marital and civil partnership/union status, pregnancy (including unlawful discrimination on the basis of a legally protected pregnancy/maternity leave), veteran status, genetic information, or any other characteristic protected by law.