Managing ADLS gen2 using Apache Spark

Описание к видео Managing ADLS gen2 using Apache Spark

Managing big data stored on ADLSgen2/Databricks may be challenging. Setting up security, moving or copying the data of Hive tables or their partitions may be very slow, especially when dealing with hundreds of thousands of files. Procter & Gamble developed a framework (to be open-sourced before the conference), which takes performance of these operations to the next level. By leveraging Apache Spark parallelism, low level file system operations, as well as multithreading within the tasks, we managed to reduce time needed to manage ADLS files by less than 10x. Finally, ADLS files security management can be done by any Data Engineer without profound understanding of ADLS REST API. It also provides new capabilities to Apache Spark applications, to easily move files/folders/tables/partitions with just a line of code. This presentation will show problems, which we are solving using this framework as well as previous solutions, which did not work well. Next we will present in details how this problem was solved using Spark API and what higher level methods are available in the framework. We will walk through available options and planned extensions to the library.

Connect with us:
Website: https://databricks.com
Facebook:   / databricksinc  
Twitter:   / databricks  
LinkedIn:   / databricks  
Instagram:   / databricksinc   Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

Комментарии

Информация по комментариям в разработке