Image Loading

Platform Engineer (AI Labs)

Job Description

We seek an AI Cloud Platform System Engineer to build, scale and optimize LLM training/inference/Data Platform. This role spans distributed training systems, GPU/CPU compute optimization, inference frameworks optimization and data platform for training/inferencing. You will ensure a resilient, cost-efficient platform for both training and production inference workloads, leveraging Kubernetes-native solutions.

Key Responsibilities

Distributed Training/Inference Platform Development

  • Design and maintain scalable platforms for distributed AI/ML training and serverless inference.
  • Optimize workload distribution across GPU clusters (e.g., model parallelism, mixed-precision training) for performance and cost.
  • Integrate frameworks like PyTorch, DeepSpeed, Triton, vLLM, and NVIDIA NeMo.
  • Collaborate with AI researchers to optimize model architectures for training/inference latency and throughput.

Platform & System Optimization

  • Compute: Profile and debug bottlenecks using tools like PyTorch Profiler and NVIDIA Nsight.
  • Storage/Caching: Build high-throughput data pipelines using S3, PVC, or distributed streaming (e.g., Kafka).
  • Networking: Reduce bottlenecks via RDMA/InfiniBand, NCCL, and TCP/IP tuning.
  • GPU Utilization: Implement kernel fusion, memory optimization, and auto-scaling.

Kubernetes-Centric Development

  • Develop Kubernetes Custom Resource Definitions (CRDs) to automate deployment, scaling, fault recovery, and monitoring of AI workloads.
  • Build operators for intelligent resource scheduling, Auto-Scaling (HPA/VPA), and fault tolerance for distributed training/inference jobs.
  • Build observability tools for GPU utilization, model latency, and system health
  • Leverage tools like Kubeflow, Kserve, KubeRay or SkyPilot for workflow orchestration.

Preferred Qualifications

Technical Skills

  • 2+ years of experience in ML infrastructure (LLM training/inference platforms preferred).
  • Proficiency in Kubernetes (CRDs, Operators, Helm, Knative, Kserve), PyTorch, and cloud-native systems (AWS/GCP/Azure).
  • Expertise in distributed training optimizations (e.g, Nemo, Pytorch, DeepSpeed) and inference frameworks (e.g. Triton, vLLM, Sglang).
  • LLM-specific optimizations (e.g., MoE architectures, speculative decoding).
  • Networking (InfiniBand, NCCL) and storage solutions (S3, Ceph/MinIO, PVC)

Education & Soft Skills

  • MS/PhD in Computer Science, AI/ML, or equivalent hands-on experience.
  • Strong collaboration skills to interface with research and engineering teams.
  • Problem-solving agility to balance performance, cost, and scalability.

Skills

  • ML
  • Kubernetes
  • Cloud platform
  • AI
  • LLM

Education

  • Master's Degree
  • Bachelor's Degree

Job Information

Job Posted Date

Apr 02, 2025

Experience

2 to 6 Years

Compensation (Annual in Lacs)

Best in the Industry

Work Type

Permanent

Type Of Work

12 hour shift

Category

Information Technology

Copyright © 2022 All Rights Reserved. Saas Talent