We are hiring for an exciting opportunity as a Lead Site Reliability Engineer! Support the hardware/software/network technologies environment by proactively monitoring and quickly responding to hardware/software/network incidents for one or more technologies within the technical area of expertise. Frequently collaborate with vendor/contractor partners to develop and implement detailed design, configuration, and engineering strategies/solutions to resolve issues/incidents while remaining focused on security, up-time and performance. Provide troubleshooting and resolution to routine/semi-complex problems. Come and apply today! Responsibilities
- Develop useful telemetry, alerts, and response to reduce Mean Time To Repair (MTTR); Collaborate and provide technical excellence within and across teams; Consult on Standard processes and develop tools to enable smooth adoptions of good service reliability practices and methods; Identify areas of improvement in reliability, efficiency, and operations; Build tools to help your SRE team quickly pinpoint, isolate and resolve issues related to infrastructure, platform services and applications; Continuously refine monitoring processes, configurations, and thresholds; Develop runbooks and tools to streamline processes and shorten problem resolution time; Write code that improves scalability, performance, maintainability, and security; Add, tune and maintain alert configurations and documentation as needed; Cultivate full-team participation in high quality, thoughtful software; Develop and improve CI/CD processes to improve release cadence and success; Use Chaos Engineering principles and methodologies to test what you build under real-world conditions; Mentor SREs in technical and non-technical SRE responsibilities; Implementing monitoring and instrumentation for on prem and cloud based applications using various monitoring tools (Splunk / Sumologic, Prometheus and Grafana)
- Excellent communication skills, both verbal and written
- Passionate and curious about ways to leverage technology while continually learning Ability to identify root-cause sources of instability in a high-traffic, large-scale distributed systems
- Experience in designing, building, and operating large-scale production systems
- Efficiently skilled with the use of containers in enterprise production environments (e.g. Docker, Kubernetes, LXC, AWS ECS and EKS)
- Ensure the up time and response time SLAs/OLAs for services are met and or exceeded. Proactively monitor the stability and performance of various technologies within area of expertise and takes appropriate corrective action prior to an incident or problem occurring. Ensure patching and regular maintenance is performed as required. Actively collaborate with fellow members of the team and contractors/vendors on bridge calls to prevent or resolve incidents/problems in an expeditious manner.
- Bachelors degree or equivalent in Computer Science, MIS, or related field.
- 5-7 years of relevant experience required.
- 2+ years of experience providing day-to-day oversight/supervision to a team of technical employees and/or vendor partner resources.
- 3-5+ years of broad technical experience with proven expertise in several of the following areas: servers, networks, hardware, operating systems (Windows, UNIX, Solaris, Linux, AIX), virtualization software, middleware and related base build infrastructure and software.
- Experience and subject matter expertise in the web and distributed computing environment, as well as mainframe experience.
- 2+ years of experience and proven success identifying and implementing opportunities for improvement to configurations, procedures, and process to enable greater availability, capability and efficiency.
About Our Company
- Certifications preferred: ITIL Foundations
- Programming experience with at least one modern language such as C++, C#, Java, Python, Golang, PowerShell, Ruby.
- Expertise in creating DevOps strategy by implementing CI/CD of code with tools like Version controls,
- Jenkins, Maven etc. and Config Management tools like Chef, Ansible & Puppet.
- Can work for microservices architecture by creating applications and its components to be
- independently scalable, versionable and deployable.
- Experience in ServiceNow
- Experience with service orientated architectures
- Advanced knowledge of TCP/IP networking, architecture, and core technologies (such as DNS, DHCP, HTTP, Routing, VPN)
The Ameriprise Financial Technology team mission is to create innovative technology solutions and engaging digital experiences for our clients, advisors, and employees. We embrace an inclusive and collaborative culture that allows us to partner across the business and lend our expertise in the areas of corporate computing, network infrastructure and security. We celebrate the unique qualities and reward the contributions of our talented, passionate employees. If you're motivated and want to work for a strong, ethical company that cares about you and your community, take the next step with Ameriprise Technology.
Ameriprise Financial is an equal opportunity employer. We consider all qualified applicants without regard to race, color, religion, sex, national origin, genetic information, age, sexual orientation, citizenship, gender identity, disability, veteran status, marital status, family status or any other basis prohibited by law.