Parallel table ingestion with a Spark Notebook (PySpark + Threading)

Описание к видео Parallel table ingestion with a Spark Notebook (PySpark + Threading)

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using multithreading. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a PySpark notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin.

Written tutorial and links to code:
https://dustinvannoy.com/2022/05/06/p...

More from Dustin:
Website: https://dustinvannoy.com
LinkedIn:   / dustinvannoy  
Twitter:   / dustinvannoy  
Github: https://github.com/datakickstart

CHAPTERS:
0:00 Intro and Use Case
1:05 Code example single thread
4:36 Code example multithreaded
7:15 Demo run - Databricks
8:46 Demo run - Azure Synapse
11:48 Outro

Комментарии

Информация по комментариям в разработке