SaaS Talent

Site Reliability Engineering Manager

10 Years of Experience

Pune, Maharashtra, India

88882*****

Expected Salary

65

Current Salary

-

Notice Period

Not Available

About

With a robust 9-year tenure in IT, I lead a high-achieving SRE team dedicated to streamlining processes and ensuring comprehensive observability across cloud platforms. As the SRE Manager, I steer the enhancement of system reliability through strategic leadership. Leveraging extensive experience, my focus remains on fine-tuning complex IT infrastructures for peak efficiency and exceptional performance. Our commitment to innovation drives us to explore cutting-edge technology, empowering us to deliver impactful solutions that optimize system performance. CERTIFICATIONS: ● HashiCorp Certified Terraform Associate 003 ● AWS Certified Solutions Architect – Associate ● Gremlin Certified Chaos Engineering Practitioner ● Red Hat Certified System Administrator ● ITIL® Foundation Certificate in ITSM ● All 4 PagerDuty Certifications ● 5 SumoLogic Certifications TECHNICAL SKILLS: ● Cloud: Amazon Web Services (AWS) ● Programming: Python ● Infrastructure as Code: Terraform, Cloudformation ● OS: Unix/Linux ● Scripting: Shell Scripting ● Container Orchestration: Kubernetes and Docker ● CI/CD: Jenkins and Git/GitLab ● Observability Tools: Prometheus, Grafana, AppDynamics, DataDog, SumoLogic ● Incident Management: PagerDuty ● Agile Tools: Polarian, Jira, Azure DevOps

Site Reliability Engineering Manager

Siemens Digital Industries Software, IT Services & Solutions, Information Technology & Services

Past Company 2

Siemens Digital Industries Software

Past Company 3

NICE CXone

Companies Worked:

Siemens Digital Industries Software, Siemens Digital Industries Software, NICE CXone, Amdocs, Tata Consultancy Services

Work History:

Job Title : Site Reliability Engineering Manager
Company name : Siemens Digital Industries Software
Period : October 2023 - Present
Summary : ● Defining and meticulously monitoring Service Level Indicators (SLI) and Service Level Objectives (SLO) to gauge performance and reliability accurately. Concurrently, managing Error Budgets to balance innovation with reliability effectively.
● Spearheading automation initiatives to streamline operational workflows, enhance efficiency within the SRE team, and minimize manual interventions.
● Implementing robust monitoring practices (Metric, Log, Traces, Synthetic, RUM, APM) to establish comprehensive observability and swiftly detect anomalies, ensuring system reliability.
● Orchestrating incident management processes to promptly mitigate service disruptions, focusing on Mean Time to Acknowledge (MTTA) and Mean Time to Resolve (MTTR) Management. Conducting Postmortems and Capacity Planning to prevent recurrence.
● Overseeing change management processes to minimize disruptions and uphold reliability in production systems.
● Architecting scalable and reliable distributed systems, advocating immutable infrastructure practices, and promoting DevOps methodologies to improve software delivery and reliability.
● Advocating for chaos engineering practices to systematically test system resilience and enhance overall reliability.
● Developing comprehensive disaster recovery plans to ensure business continuity, including risk identification, critical system prioritization, robust backups, and regular testing.
● Collaborating closely with security teams to embed best practices into operational workflows, implementing controls, conducting audits, and ensuring compliance.
● Investing in the professional growth of SRE team members through mentorship, training, performance evaluations, and career development plans to nurture excellence and support long-term aspirations.
● Ensuring effective communication with stakeholders, providing regular updates on system metrics, gathering feedback, and transparently addressing concerns to maintain alignment with business objectives.
Location : Pune, Maharashtra, India

Job Title : Site Reliability Engineer
Company name : Siemens Digital Industries Software
Period : September 2021 - January 2024
Summary : Responsibilities:
● Lead requirement gathering for new software-as-a-service (SaaS) products and onboard them to the site reliability engineering (SRE).
● Design and implement a comprehensive SRE strategy for new SaaS products.
● Ensure high availability, performance, and reliability of systems through proactive monitoring and troubleshooting.
● Set up full stack observability tooling (infrastructure, application, business transaction, synthetic monitoring).
● Automate tasks and processes to improve efficiency and reduce TOIL.
● Implement SRE ideology, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
● Participate in incident response and postmortem analysis.
● Assume on-call responsibilities to support production systems.
● Manage and maintain infrastructure, including implementing security measures.
● Collaborate with other teams, such as development and operations, to ensure smooth system operation.
● Implement and maintain continuous integration and delivery pipelines.
● Manage and maintain documentation, including Standard Operating Procedures and Runbooks.
● Participate in capacity planning and resource optimization.
Location : Pune, Maharashtra, India

Job Title : Site Reliability Engineer
Company name : NICE CXone
Period : October 2019 - September 2021
Summary : Responsibilities:
● Implement the SRE solution for CXOne product and lead requirement gathering to identify their reliability needs.
● Automate and develop infrastructure to improve the efficiency and effectiveness of existing systems, with the aim of reducing manual labor (TOIL).
● Manage systems at scale and ensure the reliability and high uptime of internally critical services and externally visible systems.
● Monitor system capacity and performance constantly.
● Participate in incident response and postmortem analysis.
● Assume on-call responsibilities to support production systems
● Collaborate with and contribute to other teams within the organization.
● Assist with the deployment of new products and services.
● Facilitate communication and collaboration between the Network Operations Center (NOC) and Research and Development (R&D) teams.
● Introduce Chaos Engineering and conduct drills to test and improve the resilience of systems.
● Implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and an Error Budget to measure and improve the reliability of systems.
Location : Pune Area, India

Job Title : Site Reliability Engineer
Company name : Amdocs
Period : February 2019 - October 2019
Summary : Responsibilities:
● Provide third level support to Fraud View Application on Unix/Windows platform.
● Automating the repetitive task to make application streamlined and more efficient.
● Troubleshoot, debug, evaluate and resolve computer-identified alarms.
● Perform deep dive for issue and perform root cause analysis.
● Change management, Incident management, Problem management for Globe Telecom service.
● Coordinating with all stakeholders to ensure timed delivery of the changes and resolution of production issues.
● Create and maintain documentation for new business process, knowledge articles and operating procedures.
Location : Pune Area, India

Job Title : Production Support Engineer
Company name : Tata Consultancy Services
Period : September 2014 - February 2019
Summary : Responsibilities:
● Provide second level support to enterprise service bus applications on Unix/Windows platform.
● Troubleshoot, debug, evaluate and resolve computer-identified alarms.
● Perform deep dive for issue and take part in root cause analysis meetings.
● Change management, Incident management, Problem management for J

Certifications:

Title : HashiCorp Certified: Terraform Associate (003)
Period : September 2023 - September 2025
Summary : 61d986ac-04c4-4ea7-950f-4e698b8f3dcf, credly.com, https://www.credly.com/badges/61d986ac-04c4-4ea7-950f-4e698b8f3dcf
Issuing Authority : HashiCorp

Title : Sumo Logic Fundamentals Certified
Period : August 2022 - August 2024
Summary : 3s4nqhqfoxpg, skilljar.com, https://verify.skilljar.com/c/3s4nqhqfoxpg
Issuing Authority : Sumo Logic

Title : Incident Responder Certification
Period : June 2022 - Present
Summary : 78d1674c-b510-4c9b-93a3-039a4b3ed672, credly.com, https://www.credly.com/badges/78d1674c-b510-4c9b-93a3-039a4b3ed672?source=linked_in_profile
Issuing Authority : PagerDuty

Title : PagerDuty API Certification
Period : June 2022 - Present
Summary : f4e2d897-af2e-4b91-b26e-c8ebbb75fb0d, credly.com, https://www.credly.com/badges/f4e2d897-af2e-4b91-b26e-c8ebbb75fb0d?source=linked_in_profile
Issuing Authority : PagerDuty

Title : PagerDuty Customer Service Operations Certification
Period : June 2022 - Present
Summary : 5a65de2d-d2a2-4b42-ba98-dd255b6fa57e, credly.com, https://www.credly.com/badges/5a65de2d-d2a2-4b42-ba98-dd255b6fa57e?source=linked_in_profile
Issuing Authority : PagerDuty

Title : PagerDuty Foundational Practitioner Certification
Period : June 2022 - Present
Summary : a8bc205d-5b95-4d80-973c-75cb7213b57b, credly.com, https://www.credly.com/badges/a8bc205d-5b95-4d80-973c-75cb7213b57b?source=linked_in_profile
Issuing Authority : PagerDuty

Title : Prometheus | The Complete Hands-On for Monitoring & Alerting
Period : October 2021 - Present
Summary : UC-26560953-df36-4c29-baf5-fcd128755f79, ude.my, https://ude.my/UC-26560953-df36-4c29-baf5-fcd128755f79
Issuing Authority : Udemy

Title : Gremlin Certified Chaos Engineering Practitioner
Period : June 2021 - Present
Summary : 33883034, credential.net, https://www.credential.net/11608eca-d133-410d-9265-887c38bb1338#gs.4cgjbk
Issuing Authority : Gremlin

Title : Leadership Fundamentals
Period : September 2020 - Present
Summary : linkedin.com, https://www.linkedin.com/learning/certificates/f7ad576585194cedfbbae1ca9c3a9b378fd49d4d5265ca420bab8de9abd39544?trk=backfilled_certificate
Issuing Authority : LinkedIn

Title : Leadership Foundations: Leadership Styles and Models
Period : August 2020 - Present
Summary : linkedin.com, https://www.linkedin.com/learning/certificates/d21e7dc77c49351f05607045872613443a7bf9e6890bd7eea2963214dc75043c?trk=backfilled_certificate
Issuing Authority : LinkedIn

Title : Site Reliability Engineering: Service-Level Agreements and Objectives
Period : June 2020 - Present
Summary : linkedin.com, https://www.linkedin.com/learning/certificates/6157ef71c02158792ff4870fffa9a8915521cac03b8b83c25563a7b08c305b70?trk=backfilled_certificate
Issuing Authority : LinkedIn

Title : DevOps Foundations: Site Reliability Engineering
Period : May 2020 - Present
Summary : linkedin.com, https://www.linkedin.com/learning/certificates/5ca67aa84a10495929ba2fd9eddc461b46b7b5a0ff53f10447f9a779a9eb4e31?trk=backfilled_certificate
Issuing Authority : LinkedIn

Title : Leading without Formal Authority
Period : May 2020 - Present
Summary : linkedin.com, http://www.linkedin.com/learning/leading-without-formal-authority?trk=flagship-lil_details_certification
Issuing Authority : LinkedIn

Title : Red Hat System Administrator
Period : January 2018 - Present
Summary : 180-019-770, redhat.com, https://www.redhat.com/rhtapps/services/verify?certId=180-019-770
Issuing Authority : Red Hat

Title : ITIL® Foundation Certificate in IT Service Management
Period : April 2016 - Present
Summary : GR750235527SD, google.com, https://drive.google.com/file/d/0B5z_VLgwn8lHRnFTMkV4Z3ZRVG1ibHRFTEItNldHZTVKMVZz/view?usp=sharing
Issuing Authority : PeopleCert

Title : AWS Certified Solutions Architect – Associate
Period : November 2020 - November 2023
Summary : 2e0cf673-3e40-4623-a265-c599006f5da5, youracclaim.com, https://www.youracclaim.com/badges/2e0cf673-3e40-4623-a265-c599006f5da5?source=linked_in_profile
Issuing Authority : Amazon Web Services (AWS)

Title : Sumo Logic Metrics Mastery Certified
Period : September 2022 - September 2023
Summary : fc3toxyft6h4, skilljar.com, https://verify.skilljar.com/c/fc3toxyft6h4
Issuing Authority : Sumo Logic

Title : Sumo Logic Search Mastery Certified
Period : September 2022 - September 2023
Summary : 5bqutvbzwtzu, skilljar.com, https://verify.skilljar.com/c/5bqutvbzwtzu
Issuing Authority : Sumo Logic

Title : Sumo Logic Administration Certified
Period : August 2022 - August 2023
Summary : co5ibmhxix9i, skilljar.com, https://verify.skilljar.com/c/co5ibmhxix9i
Issuing Authority : Sumo Logic

Title : Sumo Logic Cloud Observability Fundamentals Certified
Period : August 2022 - August 2023
Summary : b8r5mcyaahjn, skilljar.com, https://verify.skilljar.com/c/b8r5mcyaahjn
Issuing Authority : Sumo Logic

Languages:

English , Hindi , Marathi

Skills

Leadership

Site Reliability Engineering

Amazon Web Services (AWS)

Infrastructure Automation

Datadog

Python (Programming Language)

Kubernetes

Computer Science

Gitlab

Infrastructure as code (IaC)

Terraform

Continuous Integration and Continuous Delivery (CI/CD)

Application Monitoring

Containerization

Automation

Bash

Shell Scripting

ITIL Certified

Incident Management

Troubleshooting

Show More

Notes & Recommendation

Copyright © 2022 All Rights Reserved. Saas Talent