Out-of-core Computations with nanoarrow

Описание к видео Out-of-core Computations with nanoarrow

As data growth outpaces advances in computer memory, data analysts inevitably run into issues when computations require more memory than the system can provide. Fortunately, better tooling continues to evolve that makes the computation of ever-growing datasets possible.

In this video, I walk you through how to build an out-of-core summation algorithm, using the combination of nanoarrow and C++. With a basic understanding of the Apache Arrow project and its data structures, nanoarrow can be used to write simple, expressive algorithms at scale.

As you will see in the video, the summation algorithm we implement can be used to compute over 80 GB of data, all from the comfort of my laptop! Using the same development pattern with your custom algorithms, you can apply your computation to datasets of any size, and potentially even deploy your algorithm to a distributed computing environment!

Source code for this project is available at https://github.com/WillAyd/bearly/

00:00 - Introduction and Solution Overview
01:25 - PyArrow, Arrow, and nanoarrow Terminology
05:43 - Initial project with meson, nanoarrow, nanobind
11:29 - Starting our Python extension
13:15 - Passing Arrow streams from Python to our extension
19:45 - Extracting the schema from our Arrow stream
22:32 - Iterating the batches of our Arrow stream
24:31 - Iterating the arrays of each batch
28:08 - Summing the elements of each array
29:36 - Testing our solution
31:25 - Conclusion and next steps

Комментарии

Информация по комментариям в разработке