Saas Talent

Infrastructure Engineer

Job Description

About Simplismart
A bit about our product - Simplismart is an MLOps platform with 3 major suites:

Training suite: Assemble and train any model, including LLMs, vision, audio, tabular, and tree models.
Deployment suite: Most companies fail to make models production-ready. Our proprietary model deployment suite is 6x faster than HuggingFace’s enterprise suite and 12x faster than replicate.ai. Users can easily deploy (auto-scale) models trained on Simplismart (more optimised), import any model from HuggingFace, or even a Pytorch/Tensorflow artefact: Tensorflow, Pytorch, ONNX, JAX.
Observability suite: Monitor model health, including load, latency, uptime, data drift, and concept drift.

Position Overview

As a Cloud Engineer, you will contribute to building a highly available, global, multi-cloud PaaS platform using open-source technologies to support Simplismart’s rapid growth. This system encompasses diverse environments (Kubernetes, VMs, bare metal compute) and provides a cohesive and reliable abstraction for running AI workloads. You will be able to work with cutting-edge technologies and solve complex problems.

To be successful in this role, you need to be deeply technical, possess strong communication and collaboration skills, and have experience in infrastructure-as-code. Proficiency with tools like Terraform and Ansible and strong software development fundamentals is essential. Additionally, you should have a good understanding of systems knowledge and troubleshooting abilities.

Requirements

5+ years of experience writing high-performance, well-tested, production-quality code and platform engineering.
Proficiency in at least one backend programming language (Python desired; C++ is a plus)
Demonstrated experience with high-performance or distributed cloud microservices architectures.
Ideally, you should have experience building and operating globally using multiple cloud providers such as AWS, Azure, or GCP.
A good understanding of low-level operating systems concepts, including multi-threading, memory management, networking and storage, performance, and scale.
Pragmatic, methodical, well-organized, detail-oriented, and self-starting.
Experience with Kubernetes, containerization, Terraform and Ansible.
Experience with Pytorch or Tensorflow is a plus. (not necessary)
Knowledge of GPU programming, NCCL and CUDA is a plus.

Responsibilities

Designing the high-level architecture of the MLOps platform from the ground up.
Handling formalisation of diverse GPU-based workloads.
Developing a robust internal system for continuous deployment of various services and modules in diverse environments.
Create frameworks for reliable and fault tolerant systems for mission-critical workloads.

Skills And Attributes

Deep technical expertise.
Strong communication and collaboration skills.
Experience in infrastructure-as-code (Terraform, Ansible).
Strong software development fundamentals.
Good systems knowledge and troubleshooting abilities.
Ability to work independently and as part of a team.
Proactive and self-motivated.

Skills

Python
C++
Kubernetes
Cloud platform
System Administration
Software Development
Troubleshooting

Education

Master's Degree
Bachelor's Degree

Job Information

Job Posted Date

Sep 11, 2024

Experience

5-10 Years

Compensation (Annual in Lacs)

₹ Market Standard

Work Type

Permanent

Type Of Work

8 hour shift