Next Generation Big Data Pipelines with Prefect and Dask

Описание к видео Next Generation Big Data Pipelines with Prefect and Dask

Saturn Cloud's Senior Data Scientist, Aaron Richter, presents on big data pipelines using Prefect and Dask:


Data pipelines are crucial to an organization’s data science efforts. They ensure data is collected and organized in a timely and accurate manner, and is made available for analysis and modeling. In many cases, these pipelines require parallel computing. That might be because they involve “big compute” (many tasks to execute in parallel) or “big data” (large datasets which have to be processed in chunks). In this talk we’ll introduce the next-generation stack for big data pipelines built upon Prefect and Dask, and compare it to popular tools like Spark, Airflow, and the Hadoop ecosystem. We’ll discuss pros and cons of each, then take a deep dive into Prefect and Dask.

Dask is a Python-native parallel computing framework that can distribute computation of arbitrary Python functions up to high-level DataFrame and Array objects. It also has machine learning modules that are optimized to take advantage of these distributed data structures. Prefect is a workflow management system created by engineers who contributed to Airflow, and was specifically designed to address some of Airflow's shortcomings. It is built around the “negative engineering” paradigm - it takes care of all the little things that might go wrong in a data pipeline. Then when computations need to be distributed, Prefect integrates seamlessly with Dask clusters through its executor interface.

Slides: https://docs.google.com/presentation/...

For more information or to try this out yourself, get a free trial of Saturn Cloud Hosted here: https://www.saturncloud.io/s/tryhoste...

Комментарии

Информация по комментариям в разработке