Dask DataFrame: An Introduction

Описание к видео Dask DataFrame: An Introduction

In this video, Matt Rocklin gives a brief introduction to Dask DataFrames.

Dask is a free and open-source library for parallel computing in Python. Dask is a community project maintained by developers and organizations.

Dask helps you scale your data science and machine learning workflows and also makes it easy to work with Numpy, pandas, and Scikit-Learn. Dask is a framework to build distributed applications that has been used with dozens of other systems like XGBoost, PyTorch, Prefect, Airflow, RAPIDS, and more.

Dask DataFrames scale pandas workflows, enabling applications in time series, business intelligence, and general data munging on big data. A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames. Dask Dataframes coordinate many Pandas dataframes, partitioned along with an index.

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines. Because the dask.dataframe application programming interface (API) is a subset of the Pandas API, it should be familiar to Pandas users.

Share your feedback with us in the comments and let us know:

- Did you find the video helpful?
- Have you used Dask before?

Learn more at https://docs.dask.org/en/latest/dataf...

KEY MOMENTS
00:00 - Intro
00:15 - Start with Pandas
01:22 - Dask DataFrames
02:26 - Multiple files
03:14 - Dask DataFrame Partitions
04:33 - Mapping a Function Across All Partitions
06:35 - Metadata
06:46 - Parquet

Комментарии

Информация по комментариям в разработке