Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. We’re looking for a candidate who knows how to apply engineering principles to operations. You have demonstrable experience managing or developing multi-tenanted & multi-cloud solutions. You are well versed in a large number of technologies and welcome new tools and techniques. You work in conjunction with fellow developers and operations members to come to the best possible solution. You are always looking for patterns and ways to increase efficiency, eliminate downtime, optimize costs, and maintain performance at scale.
What youre good at
- Evangelize SRE mindset and solve problems through systematization.
- Develop predictive analytics, dashboards and monitoring.
- Develop high quality, industry standard CI/CD orchestration systems to reduce friction for software delivery.
- Provide real-time troubleshooting of application workflows and incorporate feedback to development.
- Ensure a high degree of availability across all of our service offerings.
- Build exhaustive proactive monitoring capabilities across all channels that alert before a customer escalates.
- Work with the development teams to design scalable, robust systems using cloud architecture.
- Build automation using industry tools (like Bamboo, Jenkins, Puppet, Ansible, etc.) to deploy applications and services.
- Manage, support and build applications within PCF.
- Application and tool development using Java, C#, Python, Angular and others.
- Identify bottlenecks and problems throughout the infrastructure.
- Prefer to build automation to perform redundant tasks rather than manually handling toil.
- Enjoy pushing scalability to the limit with high throughput services.
- Design solutions with failure in mind to ensure reliability.
- Like looking through metrics and logs as if it were a treasure hunt.
- Avoid logging into servers directly and prefer using automation and aggregation to manage them.
- Strive to be a responsible enabler rather than a "gate."
What you have
- Extensive CI/CD experience. Bitbucket, Bamboo, SonarQube, Veracode, Nexus
- Linux & Windows 2016 system administration, troubleshooting and tuning.
- Demonstrated understanding of full stack infrastructure environments: Networking, Storage, Compute and Cloud.
- GCP, Azure, AWS or equivalent cloud platform experience.
- Extensive experience supporting Internet/Web delivered application platforms.
- Full stack feature/tool development experience at a senior level.
- Understanding of software theory and algorithms, computational theory and practice, software engineering lifecycle/design, database design and data storage, network security and architecture, operating systems and ethical programming.
- Demonstrated experience with modern versions of PowerShell.
- Expert level experience in complex PCF environments.
- Experience with containers and HA clusters; experience with Docker and Kubernetes a plus.
- Writing SQL Queries, T-SQL stored procedures and views.
- Knowledge of IP networking including TCP, UDP, DNS, DHCP, firewalls, IP routing, etc.
- Knowledge of distributed systems and messaging architectures.
- Sound automation platform experience with one more tools: Puppet, Chef, Ansible, SaltStack.
- Comfortable writing extensive documentation and architectural diagrams.