Image Loading

Data Platform Engineer

Job Description

  • Bangalore

About Onehouse

Onehouse is a mission-driven company dedicated to freeing data from data platform lock-in. We deliver the industry’s most interoperable data lakehouse through a cloud-native managed service built on Apache Hudi. Onehouse enables organizations to ingest data at scale with minute-level freshness, centrally store it, and make available to any downstream query engine and use case (from traditional analytics to real-time AI / ML).

We are a team of self-driven, inspired, and seasoned builders that have created large-scale data systems and globally distributed platforms that sit at the heart of some of the largest enterprises out there including Uber, Snowflake, AWS, Linkedin, Confluent and many more. Riding off $33M total funding and a fresh Series A backed by Greylock/Addition, we are quickly expanding and looking for rising talent to grow with us and become future leaders of the team. Come help us build the world's best fully managed and self-optimizing data lake platform!

The Community You Will Join

When you join Onehouse, you're joining a team of passionate professionals tackling the deeply technical challenges of building a 2-sided engineering product. Our engineering team serves as the bridge between the worlds of open source and enterprise: contributing directly to and growing Apache Hudi (already used at scale by global enterprises like Uber, Amazon, ByteDance etc) and concurrently defining a new industry category - the transactional data lake.

A Typical Day:

    • Be the thought leader around all things data engineering within the company - schemas, frameworks, data models.
    • Implement new sources and connectors to seamlessly ingest data streams.
    • Building scalable job management on Kubernetes to ingest, store, manage and optimize petabytes of data on cloud storage.
    • Optimize Spark applications to flexibly run in batch or streaming modes based on user needs, optimize latency vs throughput.
    • Tune clusters for resource efficiency and reliability, to keep costs low, while still meeting SLAs

What You Bring to the Table:

  • 3+ years of experience in building and operating data pipelines in Apache Spark.
  • 2+ years of experience with workflow orchestration tools like Apache Airflow, Dagster.
  • Proficient in Java, Maven, Gradle and other build and packaging tools.
  • Adept at writing efficient SQL queries and trouble shooting query plans.
  • Experience managing large-scale data on cloud storage.
  • Great problem-solving skills, eye for details. Can debug failed jobs and queries in minutes.
  • Operational excellence in monitoring, deploying, and testing job workflows.
  • Open-minded, collaborative, self-starter, fast-mover.

Nice to haves (but not required):

  • Hands-on experience with k8s and related toolchain in cloud environment.
  • Experience operating and optimizing terabyte scale data pipelines
  • Deep understanding of Spark, Flink, Presto, Hive, Parquet internals.
  • Hands-on experience with open source projects like Hadoop, Hive, Delta Lake, Hudi, Nifi, Drill, Pulsar, Druid, Pinot, etc.
  • Operational experience with stream processing pipelines using Apache Flink, Kafka Streams.

Skills

  • Java
  • Apache Spark
  • Maven
  • K8s
  • SQL Queries
  • Kafka Streams
  • Testing

Education

  • Master's Degree
  • Bachelor's Degree

Job Information

Job Posted Date

May 16, 2024

Experience

3 to 7 Years

Compensation (Annual in Lacs)

₹ Market Standard

Work Type

Permanent

Type Of Work

8 hour shift

Category

Information Technology

Copyright © 2022 All Rights Reserved. Saas Talent