Jim Crist: Dask Parallelizing NumPy and Pandas through Task Scheduling

Описание к видео Jim Crist: Dask Parallelizing NumPy and Pandas through Task Scheduling

PyData NYC 2015

Dask is a pure python library that allows for easy parallelism through task scheduling and blocked algorithms. By leveraging the existing PyData ecosystem (NumPy, Pandas, etc...), as well as some clever algorithms, we're able to compute on arrays and dataframes that are larger than memory, while exploiting parallelism.

The PyData ecosystem is great for doing data analysis. Packages like NumPy and Pandas provide an excellent interface to doing complicated computations on datasets. With only a few lines of code one can load some data into a NumPy array, run some analysis, and plot the results. However, this workflow starts to falter when working with data that's larger than the memory on your computer.

Dask is designed to fit the space between in memory tools like NumPy/Pandas and distributed tools like Spark/Hadoop. By using blocked algorithms and the existing Python ecosystem, it's able to work efficiently on large arrays or dataframes - often in parallel.

In this talk we'll discuss both the what and the how of Dask. Starting from examples of using dask collections that mirror NumPy arrays and Pandas DataFrames, we'll then dive into how these collections are actually implemented. Along the way we'll discuss the global interpreter lock and its relevance (or lack of relevance) to parallel computation in numeric Python.

Slides available here: https://speakerdeck.com/jcrist/pandas...

Github Repo: https://github.com/jcrist/Dask_PyData... 00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...

Комментарии

Информация по комментариям в разработке