PySpark: Python API for Spark

Описание к видео PySpark: Python API for Spark

UC Berkeley AmpLab member Josh Rosen, presents PySpark. PySpark is the new Python API for Spark which is available in release 0.7 This presentation was given at the Spark meetup at Conviva in San Mateo, Ca on Feb 21st 2013. Download here http://spark-project.org/downloads/

Summary:
00:33 What is Spark?
03:00 What is PySpark?
03:45 Example Word Count
04:35 Demonstration of interactive shell on AWS EC2
06:22 tracking time elapsed, %time berkeley_pages.count()
06:37 Spark web interface
09:14 Distributing data, sc.parallelize
11:20 API documentation
11:27 Python doctest, create tests from interactive samples
11:58 Example kmeans.py, k-means clustering
12:39 Getting help help(sc)
13:00 Example wordcount.py
13:18 PySpark Implementation details
14:15 PySpark less than 2K lines including comments
17:18 Pickled Objects, RDD[Array[Byte]]
17:44 Batching Pickle to reduce overhead
18:00 Consolidating operations into single pass when possible
19:27 PySpark Roadmap,
adding sorting support, file formats such as csv, PyPy JIT

Комментарии

Информация по комментариям в разработке