Site Reliability Engineer
Vanguard Australia has been helping investors achieve their long-term financial goals for over 20 years. Serving institutional and individual clients, and financial advisers, we offer investment solutions that are low-cost, diversified and robust through time.
With more than AUD $6 trillion in assets under management Vanguard is one of the world's largest global investment management companies. In Australia we partner with institutional clients, financial advisers and individual investors to offer low-cost investment solutions. Our comprehensive range of managed funds, exchange traded funds (ETFs) and tailored investment solutions are built to support long-term investment success for our clients. The Opportunity
As a Site Reliability Engineer at Vanguard you'll have the opportunity to put your operational savvy-ness and engineering skills to work! On the job you'll be ensuring the "-ilities" (Availability, Reliability, Scalability, Usability; etc.) of our private and public cloud platforms in both test and production environments. You'll respond to incidents, apply upgrades to the platform and leverage a strategic thinking mind-set to "automate all the things".
Additionally, you can anticipate working with real-time monitoring and diagnostic data, analyse trends, and plan for future infrastructure needs. As a caretaker of these systems you'll be collaborating and planning activities with our internal development teams to ensure that application service level objectives are met. As the name might suggest, a passion for reliability is a must! On the job you'll be...
- Maintaining, upgrading, and patching key systems in test and production environments.
- Managing communications and coordinating change events with key stakeholders
- Identifying and resolving reliability issues and implementing long-term mitigation strategies - ideally through automation.
- Responding to production incidents and availability needs.
- Facilitating and documenting platform post-mortems.
- Training and mentoring junior staff members on reliability practices, processes and technologies.
- Participating in an off-hours on-call rotation
What we are looking for
- Ensures reliable operation of production and test environments.
- Diagnoses and troubleshoots availability interruptions and other production issues.
- Plans and coordinates enterprise-wide infrastructure and reliability projects with other IT and client teams.
- Communicates with teams to keep them apprised of status and issues. Contacts vendors to resolve technical issues.
- Tests, installs, and migrates software, patches, upgrades, applications, and/or hardware.
- Develops technical standards. Tests and evaluates IT vendor products.
- Writes documentation, including project plans, installation procedures, and troubleshooting tips. Creates diagrams, including technical topology.
- Maintains, monitors, and tunes Production system and applications performance. Debugs source code and performance problems and/or provides debugging assistance to developers.
- Identifies opportunities to improve system and applications performance (e.g. automating manual system tasks).
- Trains and mentors staff. Resolves complex issues elevated from staff with less experience.
- Adds, updates, and closes IT Problem Management database records. Researches and resolves complex issues, and reviews related technology records to mitigate impact on assigned systems
- Reviews numerous IT knowledge repositories to update technical knowledge.
- Learns and understands client area business functions and requirements. Has the ability to determine the appropriate technical tool to address the client's business needs.
- Thoroughly understands and complies with IT & information security policies and procedures, especially those for quality and productivity standards that enable the team to meet established client service levels.
- Bachelor's Degree preferred or equivalent technical experience
- An understanding and practical experience with containerisation frameworks (Pivotal Cloud Foundry, ECS/Fargate, Heroku, Kubernetes, Docker)
- You have been a part of or led agile development teams
- Worked with Concourse, Jenkins, and/or Bamboo CI/CD pipelines
- Understanding of monitoring/telemetry solutions (Splunk, ELK, AppDynamics, etc) data ingestion and analysis
- Knowledge of Linux/Unix systems
- Passion for problem solving and strategic thinking and a desire to own and execute
- Experience with dealing with production issues
- Understanding and application of at least one scripting language (Shell, PHP, Python, etc) in pursuit of automation
- Experience with configuration automation (Chef, Ansible, Puppet)
- Experience implementing and maintaining distributed applications and systems (Microservices, 12-factor app)
- A flexible schedule - some activities you'll be performing may require off-hours or weekend support
Vanguard's continued commitment to diversity and inclusion is firmly rooted in our culture. Every decision we make to best serve our clients, crew (internally employees are referred to as crew), and communities is guided by one simple statement: "Do the right thing."
We believe that a critical aspect of doing the right thing requires building diverse, inclusive, and highly effective teams of individuals who are as unique as the clients they serve. We empower our crew to contribute their distinct strengths to achieving Vanguard's core purpose through our values.
When all crew members feel valued and included, our ability to collaborate and innovate is amplified, and we are united in delivering on Vanguard's core purpose.
Our core purpose: To take a stand for all investors, to treat them fairly, and to give them the best chance for investment success.