Saas Talent

Staff Site Reliability Engineer

Job Description

Get to know Okta

Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth.

At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box - we’re looking for lifelong learners and people who can make us better with their unique experiences.

Join our team! We’re building a world where Identity belongs to you.

As a Staff Site Reliability Engineer you will champion all things pertaining to reliability at Okta for Auth0. Working closely with the Product Engineers, Quality Engineers, Platform Engineers and Architecture teams, your primary focus will be on ensuring production systems remain operational at all times, while continually setting and achieving long-term performance, reliability and scalability goals in a platform with an exponential growth plan for the coming years.

With Okta’s increased dedication to ensuring customer availability expectations are exceeded in every way, you will play a key role as we evolve our system architecture to meet the demands of enormous growth and support the hundreds of millions of users who rely on us to provide uninterrupted access to business-critical enterprise and consumer applications.

Skills

Exceptional communication skills, including technical writing in the English language
Systematic problem-solving approach, coupled with a strong sense of ownership and drive
Understanding of microservices, cloud infrastructure (AWS, Azure), databases (SQL, No-SQL, Key/Value), containers (docker, kubernetes), web technologies (web sockets, http) and networking (SSL, routing, VPN)
Live and breathe SLIs, SLOs, error budgets and SLAs
Strong belief in automating everything and reducing toil for yourself and teammates
Loves to work as a team, but is able to work effectively in a remote environment where tasks may be self-driven

Responsibilities

Working with the other teams to run, own and improve incident response processes
Participate in regular on-call rotations to ensure 24/7 coverage of all critical systems
Use existing monitoring tools to identify problems and resolve and/or escalate to service teams
Implement changes to enable or improve infrastructure resilience, monitoring, and alerting

Experience

7+ years as a Site Reliability Engineer or in a Cloud Operations/DevOps role
6+ years using golang, shell scripting and terraform
2+ years as software developer in a SaaS environment
4+ years in a production environment supporting large-scale, mission-critical applications

Skills

Cloud Operations
Devops
Site Reliability Engineering
SaaS
Software Development

Education

Master's Degree
Bachelor's Degree

Job Information

Job Posted Date

Feb 25, 2025

Experience

7 to 10 Years

Compensation (Annual in Lacs)

Best in the Industry

Work Type

Permanent

Type Of Work

8 hour shift