Parallelization of Structured Streaming Jobs Using Delta Lake

Описание к видео Parallelization of Structured Streaming Jobs Using Delta Lake

We’ll tackle the problem of running streaming jobs from another perspective using Databricks Delta Lake, while examining some of the current issues that we faced at Tubi while running regular structured streaming. A quick overview on why we transitioned from parquet data files to delta and the problems it solved for us in running our streaming jobs. After converting our datasets to Delta Lake. We will then explore techniques in which we can maximize the cluster utilization by submitting multiple streaming jobs from the driver to run in parallel using scala parallel collections. We’ll discuss techniques to write and implement idempotent tasks that can be parallelized. In conclusion, we will discuss an advanced topic on running a parallel streaming backfill job and the nuances in handling failure and recovery. Demos using databricks notebooks will be shown throughout the presentation.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unifie...

Connect with us:
Website: https://databricks.com
Facebook:   / databricksinc  
Twitter:   / databricks  
LinkedIn:   / databricks  
Instagram:   / databricksinc   Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

Комментарии

Информация по комментариям в разработке