Partner with product owners and business SMEs to analyze the business needs and improve support ability, scalability and recovery for the engineered solution.
Ensure that the overall technical solution is aligned with the business needs and operational teams methodologies
Drive the improvement of service availability to reduce the mean time to recovery using automation.
Develop methods for autonomous recovery and self-repairing systems. Ensure the solution is consistent with RFPIO architecture, design and development standards
Coordinate and plan system releases and hotfixes.
Develop methods that allow simplified triage following a set of checklists, run books and standard operating procedures.
Make adjustments to adopt new methodologies that provide the business with increased flexibility and agility
Support software development by providing operational improvements to non-functional requirements.
Develop enhancements to improve service levels by leveraging key performance indicators consisting of monitoring, non-functional testing and availability reports.
Provide a service-focused approach leveraging continuous process improvement.
Participate in chaos testing to improve system resiliency. Mentor other engineers. Provide overall technical leadership to smaller working teams as needed
Stay current with latest development tools, technology ideas, patterns and methodologies; share knowledge by clearly articulating results and ideas to key stakeholders
Experience:
At least 3 to 5 years in a Site Reliability Engineering, DevOps, or Infrastructure focused role
Experience supporting internet-facing production services and distributed systems
Ability to implement and coordinate telemetry using monitoring and observability tools such as Splunk, Grafana or Prometheus
Coding experience using a high-level programming languages like: Java, or Python
Automation advocate - you truly believe in removing operational load via software
A strong sense of ownership.
Experience managing, scaling, and troubleshooting Java applications
Familiarity with cloud infrastructure concepts (zones, regions, VPCs, etc)
An understanding of a variety of software service deployment packaging, strategies, and tooling
Working understanding of common authentication schemes, certificates, and securely managing secrets
Capable of designing and implementing automated configuration management processes for repeatable and consistent service deployment
Education:
BS or MS in Computer Science or equivalent industry experience
Knowledge, Skills & Ability:
Prior experience as an SRE, software engineer, DevOps Engineer, or system administrator
Experience in system automation technology, such as Ansible