Saas Talent

Site Reliability Engineer II

Job Description

Are you ready to make your mark with a true industry disruptor? ZineOne, a subsidiary of Session AI, the pioneer of in-session marketing, is looking to add talented team members to help us grow into the premier revenue tool for e-commerce. We work with some of the leading brands nationwide and we innovate how brands connect with and convert customers.

Job Description

This position offers a hands-on, technical opportunity as a vital member of the Site Reliability Engineering Group. Our SRE team is dedicated to ensuring that our Cloud platform operates seamlessly, efficiently, and reliably at scale. The ideal candidate will bring over five years of experience managing cloud-based Big Data solutions, with a strong commitment to resolving operational challenges through automation and sophisticated software tools.

Candidates must uphold a high standard of excellence and possess robust communication skills, both written and verbal. A strong customer focus and deep technical expertise in areas such as Linux, automation, application performance, databases, load balancers, networks, and storage systems are essential.

Key Responsibilities:

As a Session AI SRE, you will:

Design and implement solutions that enhance the availability, performance, and stability of our systems, services, and products
Develop, automate, and maintain infrastructure as code for provisioning environments in AWS, Azure, and GCP
Deploy modern automated solutions that enable automatic scaling of the core platform and features in the cloud
Apply cybersecurity best practices to safeguard our production infrastructure
Collaborate on DevOps automation, continuous integration, test automation, and continuous delivery for the Session AI platform and its new features
Manage data engineering tasks to ensure accurate and efficient data integration into our platform and outbound systems
Utilize expertise in DevOps best practices, shell scripting, Python, Java, and other programming languages, while continually exploring new technologies for automation solutions
Design and implement monitoring tools for service health, including fault detection, alerting, and recovery systems
Oversee business continuity and disaster recovery operations
Create and maintain operational documentation, focusing on reducing operational costs and enhancing procedures
Demonstrate a continuous learning attitude with a commitment to exploring emerging technologies

Preferred Skills:

Experience with cloud platforms like AWS, Azure, and GCP, including their management consoles and CLI
Proficiency in building and maintaining infrastructure on:
- AWS using services such as EC2, S3, ELB, VPC, CloudFront, Glue, Athena, etc
- Azure using services such as Azure VMs, Blob Storage, Azure Functions, Virtual Networks, Azure Active Directory, Azure SQL Database, etc
- GCP using services such as Compute Engine, Cloud Storage, Cloud Functions, VPC, Cloud IAM, BigQuery, etc
Expertise in Linux system administration and performance tuning
Strong programming skills in Python, Bash, and NodeJS
In-depth knowledge of container technologies like Docker and Kubernetes
Experience with real-time, big data platforms including architectures like HDFS/Hbase, Zookeeper, and Kafka
Familiarity with central logging systems such as ELK (Elasticsearch, LogStash, Kibana)
Competence in implementing monitoring solutions using tools like Grafana, Telegraf, and Influx

Skills

Cloud platform
NodeJs
Python
Database
SRE
Linux System Administration

Education

Master's Degree
Bachelor's Degree

Job Information

Job Posted Date

Aug 09, 2024

Experience

5-10 Years

Compensation (Annual in Lacs)

₹ Market Standard

Work Type

Permanent

Type Of Work

8 hour shift