Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue

Описание к видео Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue

Thank you to all of of our ApacheCon@Home 2021 sponsors, including:

STRATEGIC
---------------
Google

PLATINUM
--------------
Apple
Huawei
Instaclustr
Tencent Cloud

GOLD
-----------
Aiven OY
AWS
Baidu
Cerner
Didi Chuxing
Dremio
Fiter
Gradle
Red Hat
Replicated A presentation from ApacheCon@Home 2021
https://apachecon.com/acah2021

Spark currently supports shuffle join (either shuffled hash join or sort-merge join) which relies on both sides to be partitioned by Spark’s internal hash function. To avoid the potentially expensive shuffle phase, users can create bucketed tables and use bucket joins which ensure data is pre-partitioned on disk when written, using the same hash function, and thus won’t be shuffled again in subsequent joins.

A storage partitioned join extends the idea beyond hash-based partitioning, and allows other types of partition transforms: two tables that are partitioned by hour could be joined hour-by-hour, or two tables partitioned by date and a bucket column could be joined using date/bucket partitions. It also paves the way for data sources to provide their own hash functions, for example, bucketed tables created by Hive.

In this talk we propose extensions for Spark that will enable storage partitioned joins for v2 data sources, using partitions produced by any transform or a combination of partition values. This also proposes an extension to Spark’s distribution model to avoid processing large partitions in a single task.

Комментарии

Информация по комментариям в разработке