Job Description
Who we are
Mindtickle is the market-leading revenue productivity platform that combines on-the-job learning and deal execution to get more revenue per rep. Mindtickle is recognized as a market leader by top industry analysts and is ranked by G2 as the #1 sales onboarding and training product. We’re honoured to be recognized as a Leader in the first-ever Forrester Wave™: Revenue Enablement Platforms, Q3 2024!
Job Snapshot
As an SRE II, you will play a key role in ensuring our mission-critical systems' reliability, performance, and scalability. You will work closely with engineering teams to design, implement, and maintain infrastructure that supports high-volume data-intensive applications. Your expertise in monitoring, troubleshooting, and automation will drive operational excellence across our distributed environment.
What’s in it for you?
-
- Maintain and improve the reliability, availability, and performance of high-volume, data-intensive applications.
- Design, implement, and enhance monitoring, logging, and alerting solutions at scale.
- Collaborate with development teams to optimize system architecture and reliability.
- Manage and troubleshoot distributed systems in a Linux-based production environment.
- Leverage AWS cloud services to scale infrastructure efficiently.
- Utilize Kubernetes for container orchestration, ensuring optimal resource utilization and deployment strategies.
- Implement CI/CD pipelines using GitLab to automate deployments and operational tasks.
- Use infrastructure as code (IaC) tools such as Terraform and CloudFormation for provisioning and managing cloud resources.
- Implement observability best practices using Grafana, Prometheus, Thanos, and Loki.
- Perform root cause analysis (RCA) and proactively address performance bottlenecks and system failures.
- Ensure security best practices and compliance across all infrastructure components.
We’d love to hear from you, if you:
-
- Have 3+ years of experience in Site Reliability Engineering or related fields.
- Possesses strong Linux fundamentals with a deep understanding of system internals.
- Expertise in troubleshooting and problem-solving in distributed environments.
- Have hands-on experience with logging and monitoring solutions at scale.
- Are proficient in at least one programming language (preferably Python).
- Have strong experience with AWS services and Kubernetes.
- Have exposure to CI/CD pipelines, preferably using GitLab CI/CD.
- Have experience with infrastructure as code (Terraform, CloudFormation).
- Are familiar with observability tools such as Grafana, Prometheus, Thanos, and Loki.
Preferred Qualifications
-
- Experience in performance tuning and capacity planning.
- Knowledge of incident management and post-mortem analysis processes.
- Familiarity with security best practices in cloud environments.
- Experience in automating operational tasks using scripting and configuration management tools.