Apache Iceberg Tutorial for Beginners: Understanding Copy-on-write and Merge-on-read

Dremio
Dremio
8.9 هزار بار بازدید - 2 سال پیش - This Apache Iceberg 101 Course
This Apache Iceberg 101 Course #7 focuses on Copy-on-Write (COW) and Merge-on-Read (MOR) - two essential concepts in data lakehouse table formats. In this course, you will learn what Copy-on-Write and Merge-on-Read are, as well as what a delete file is and when to use MOR or COW.

Copy-on-Write is a process used in data lakehouse table formats where changes to the table are written to a new version of the table instead of modifying the existing version. This allows for the original version of the table to remain unchanged while new changes are applied. For example, if a user wants to add a new column to an existing table, instead of modifying the existing version of the table, a copy of it is created with the added column.

Merge-on-Read is another key concept used in data lakehouse table formats. This process involves merging multiple versions of tables when theyre read from storage by an application or query engine. The result from this process is a single view that contains all changes from different versions of tables. For example, when an application requests data from two different versions of a table, both versions are read and merged together into one view that contains all changes made to both versions.

In addition to Copy-on-Write and Merge-on Read processes, this Apache Iceberg 101 Course also covers delete files - files which contain information about rows that have been deleted from tables. Delete files can be used with either COW or MOR processes and allow users to keep track of rows that have been deleted without having to rewrite entire tables each time something needs to be removed.

When deciding whether to use Copy-on Write or Merge on Read processes, its important to consider how often data within tables needs to be modified or updated. If frequent updates need to be made, then COW might be more suitable since it allows for quick and easy modification without having to rewrite entire tables each time something needs changing. On the other hand, if large amounts of data need merging then MOR might be more suitable since it allows for multiple versions of tables can be merged together quickly and easily into one single view.

For more great content on Data Lakehouse topics such as Apache Icebergs Copy on Write (COW) and Merge on Read (MOR) processes as well as Data Warehouse, Data Lake Engine and Data Lake topics visit dremio.com/subsurface today!

Connect with us!

Twitter: https://bit.ly/30pcpE1
LinkedIn: https://bit.ly/2PoqsDq
Facebook: https://bit.ly/2BV881V
Community Forum: https://bit.ly/2ELXT0W
Github: https://bit.ly/3go4dcM
Blog: https://bit.ly/2DgyR9B
Questions?: https://bit.ly/30oi8tX
Website: https://bit.ly/2XmtEnN
2 سال پیش در تاریخ 1401/06/17 منتشر شده است.
8,938 بـار بازدید شده
... بیشتر