Image Loading

Infrastructure Engineer

Job Description

About Simplismart
A bit about our product - Simplismart is an MLOps platform with 3 major suites:

  • Training suite: Assemble and train any model, including LLMs, vision, audio, tabular, and tree models.
  • Deployment suite: Most companies fail to make models production-ready. Our proprietary model deployment suite is 6x faster than HuggingFace’s enterprise suite and 12x faster than replicate.ai. Users can easily deploy (auto-scale) models trained on Simplismart (more optimised), import any model from HuggingFace, or even a Pytorch/Tensorflow artefact: Tensorflow, Pytorch, ONNX, JAX.
  • Observability suite: Monitor model health, including load, latency, uptime, data drift, and concept drift.

Position Overview

As a Cloud Engineer, you will contribute to building a highly available, global, multi-cloud PaaS platform using open-source technologies to support Simplismart’s rapid growth. This system encompasses diverse environments (Kubernetes, VMs, bare metal compute) and provides a cohesive and reliable abstraction for running AI workloads. You will be able to work with cutting-edge technologies and solve complex problems.

To be successful in this role, you need to be deeply technical, possess strong communication and collaboration skills, and have experience in infrastructure-as-code. Proficiency with tools like Terraform and Ansible and strong software development fundamentals is essential. Additionally, you should have a good understanding of systems knowledge and troubleshooting abilities.

Requirements

  • 5+ years of experience writing high-performance, well-tested, production-quality code and platform engineering.
  • Proficiency in at least one backend programming language (Python desired; C++ is a plus)
  • Demonstrated experience with high-performance or distributed cloud microservices architectures.
  • Ideally, you should have experience building and operating globally using multiple cloud providers such as AWS, Azure, or GCP.
  • A good understanding of low-level operating systems concepts, including multi-threading, memory management, networking and storage, performance, and scale.
  • Pragmatic, methodical, well-organized, detail-oriented, and self-starting.
  • Experience with Kubernetes, containerization, Terraform and Ansible.
  • Experience with Pytorch or Tensorflow is a plus. (not necessary)
  • Knowledge of GPU programming, NCCL and CUDA is a plus.

Responsibilities

  • Designing the high-level architecture of the MLOps platform from the ground up.
  • Handling formalisation of diverse GPU-based workloads.
  • Developing a robust internal system for continuous deployment of various services and modules in diverse environments.
  • Create frameworks for reliable and fault tolerant systems for mission-critical workloads.

Skills And Attributes

  • Deep technical expertise.
  • Strong communication and collaboration skills.
  • Experience in infrastructure-as-code (Terraform, Ansible).
  • Strong software development fundamentals.
  • Good systems knowledge and troubleshooting abilities.
  • Ability to work independently and as part of a team.
  • Proactive and self-motivated.

Skills

  • Python
  • C++
  • Kubernetes
  • Cloud platform
  • System Administration
  • Software Development
  • Troubleshooting

Education

  • Master's Degree
  • Bachelor's Degree

Job Information

Job Posted Date

Sep 11, 2024

Experience

5-10 Years

Compensation (Annual in Lacs)

₹ Market Standard

Work Type

Permanent

Type Of Work

8 hour shift

Category

Information Technology

Copyright © 2022 All Rights Reserved. Saas Talent