Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue

apacheasfapache software foundationopen sourcesoftwareflossfree softwareapacheconacah2021apachecon@homeapachecon@home 2021

Скачать Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue бесплатно в качестве 4к (2к / 1080p)

У нас вы можете скачать бесплатно Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue или посмотреть видео с ютуба в максимальном доступном качестве.

Для скачивания выберите вариант из формы ниже:

Cкачать музыку Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue бесплатно в формате MP3:

Если иконки загрузки не отобразились, ПОЖАЛУЙСТА, НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если у вас возникли трудности с загрузкой, пожалуйста, свяжитесь с нами по контактам, указанным в нижней части страницы.
Спасибо за использование сервиса video2dn.com

Описание к видео Storage-Partitioned Join for Apache Spark - Chao Sun, Ryan Blue

Thank you to all of of our ApacheCon@Home 2021 sponsors, including:

STRATEGIC
---------------
Google

PLATINUM
--------------
Apple
Huawei
Instaclustr
Tencent Cloud

GOLD
-----------
Aiven OY
AWS
Baidu
Cerner
Didi Chuxing
Dremio
Fiter
Gradle
Red Hat
Replicated A presentation from ApacheCon@Home 2021
https://apachecon.com/acah2021

Spark currently supports shuffle join (either shuffled hash join or sort-merge join) which relies on both sides to be partitioned by Spark’s internal hash function. To avoid the potentially expensive shuffle phase, users can create bucketed tables and use bucket joins which ensure data is pre-partitioned on disk when written, using the same hash function, and thus won’t be shuffled again in subsequent joins.

A storage partitioned join extends the idea beyond hash-based partitioning, and allows other types of partition transforms: two tables that are partitioned by hour could be joined hour-by-hour, or two tables partitioned by date and a bucket column could be joined using date/bucket partitions. It also paves the way for data sources to provide their own hash functions, for example, bucketed tables created by Hive.

In this talk we propose extensions for Spark that will enable storage partitioned joins for v2 data sources, using partitions produced by any transform or a combination of partition values. This also proposes an extension to Spark’s distribution model to avoid processing large partitions in a single task.