Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

Databricks
Databricks
1.4 هزار بار بازدید - - In data warehouse area, it
In data warehouse area, it is common to use one or more columns in complex type, such as map, and put many subfields into it. It may impact the query performance dramatically because: 1) It is a waste of IO. The whole column (in map), which may contain tens of subfields, need to be read. And Spark will traverse the whole map and get the value of the target key. 2) Vectorized read can not be exploit when nested type column is read. 3) Filter pushdown can not be utilized when nested columns is read. Over the last year, we have added a series of optimizations in Apache Spark to solve the above problems for Parquet.

These include supporting vectorized reading on complex data type in Parquet, allowing subfields pruning on struct columns in Parquet, among many others. Besides, we designed a new feature, named materialized column, to solve all above problems transparently for arbitrary columnar storage (not only for Parquet). Materialized column works well in Bytedance data warehouse in the past year. Take a typical table as an example, the daily incremental data volume is about 200 TB. Creating 15 materialized columns on it improved the query performance by more than 110% with less than 7% storage overhead. In this talk, we will take a deep dive into the internals of materialized columns in Spark SQL, describe use-cases where materialized column is useful.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unifie...

See all the previous Summit sessions:

Connect with us:
Website: https://databricks.com
Facebook: Facebook: databricksinc
Twitter: Twitter: databricks
LinkedIn: LinkedIn: databricks
Instagram: Instagram: databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...
55 سال پیش در تاریخ 1403/04/17 منتشر شده است.
1,486 بـار بازدید شده
... بیشتر