Meet Apache Hudi : Supporting Modern Data Lake

HUDI stands for Hadoop Upserts Deletes Inserts. It is the framework designed by Uber to achieve ACID properties for distributed data in Hadoop or S3. When compared with Apache Iceberg or Apace DeltaLake, It provides more features like MOR(Merge on Read) and bulk loading.

Data Lake

Datalake is the most widely used data architecture in most of the company as of 2023 because it provides the flexibility to store unstructured, structured and semi-structured data, unlike data warehouses. With the evolution of cloud applications the need for processing big data is increasing. Hence data is stored in a compressed format like parquet, orc, Avro etc to optimize space.

But a few set of challenges come when processing data in these formats:

1)Change Data Capture(CDC): As the name suggests, It is basically capturing changes like insert, update, and delete operations from the source database and replicating them into target destination.

Three commonly used CDC methods:

a)Query-based: SQL queries that has filter condition to identify the changes.

b)Log-based: Uses transaction logs to replay messages to the destination.

c)Trigger based: Database procedure that performs action based on events.

2)Incremental Data Consumption:-As there is no way to update the provided file formats, the ETL process will need to read the entire dataset, perform the operation and write the dataset into a new file.

3)Lack of support for ACID Transaction:- These file formats do not offer functionality to perform the update, delete or insert operations. Herein Hudi helps to retain ACID properties of our table like RDBMS.

Tackling the Problem:

1)Log-based CDC provides the best result including deleted records database transactions/events which is not provided by Query-based CDC.

2)Apache Hudi updates the target data with data lake by managing the indexes and pertinent metadata required to hydrate a data lakehouse at scale.

3)Apache Hudi provides support for ACID transactions and APIs to handle multiple readers and writers without providing inconsistent results.

Types of tables:-

1)Copy on Write: Data is stored in Parquet file format (columnar storage) and each write creates a new version in s3. This storage type is most suitable for read-heavy batch workloads as the latest version of the dataset is always available.

2)Merge on Read: Data is stored as a combination of Parquet (columnar storage) and Avro (row-based storage) file formats. It will generate a new version of columnar files which is better suited for write-heavy streaming workloads as the commits are written as delta files, while reading the dataset requires compaction to merge the Parquet and Avro files.

I hope this short article on Apache hudi was of some help.

Not to forget below articles helped me understand the concept better:
https://www.qubole.com/blog/data-lake-essentials-part-2-file-formats-compression-and-security
https://medium.com/slalom-build/data-lakehouse-building-the-next-generation-of-data-lakes-using-apache-hudi-41550f62f5f

See you in the next post with some implementation ;)