Doing More with Data: An Introduction to Arrow for R Users

Описание к видео Doing More with Data: An Introduction to Arrow for R Users

Speaker: Danielle Navarro, Developer Advocate at Voltron Data

As datasets become larger and more complex, the boundaries between data engineering and data science are becoming blurred. Data analysis pipelines with larger-than-memory data are becoming commonplace, creating a gap that needs to be bridged: between engineering tools designed to work with very large datasets on the one hand, and data science tools that provide the analysis capabilities used in data workflows on the other.

One way to build this bridge is with Apache Arrow, a multi-language toolbox for working with larger-than-memory tabular data. Arrow is designed to improve performance and efficiency, and places emphasis on standardization and interoperability among workflow components, programming languages, and systems.

This talk gives an introduction to the Arrow package in R, a mature interface to Apache Arrow, that provides an appealing solution for data scientists working with large data in R. It introduces the core concepts behind Apache Arrow and the Arrow package in R, provides a walkthrough of a sample data analysis using a large tabular data set (containing about 1.7 billion rows), and highlights possible pain points for an R user new to the Arrow ecosystem.

Комментарии

Информация по комментариям в разработке