Spark Streaming: Large Scale near real-time Stream Processing

Описание к видео Spark Streaming: Large Scale near real-time Stream Processing

Tathagata Das of UC Berkeley AmpLab presents Spark Streaming which has been released as alpha in release 0.7 of Spark. This presentation was given at the Spark meetup on Feb 21st 2013 at Conviva in San Mateo, Ca.

Download: http://spark-project.org/downloads/

Summary:
00:09 Motivation
01:07 Case study: Conviva, Inc.
03:26 Goals
04:04 Existing Streaming Systems,
05:07 Storm and Trident
06:40 Discretized Stream Processing
Series of very small, deterministic batch jobs
07:52 State between batches in memory, immutable, fault tolerant
08:11 Minimum batch time period from 1/2 second to aproximately 1 second
08:46 Visual representation of Discretized Stream Processing
16:32 Fault Recovery
17:02 Fault Recovery is computed in parallel
17:12 Programming Model and DStreams
17:53 DStream Data Sources, {HDFS, Kafka, Flume, Twitter, TCP sockets,
Akka actor, ZeroMQ}
18:34 Transformations of DStreams
RDD like operations, New window and stateful operations
19:18 Output: HDFS, console, foreach arbitrary operation on every RDD
19:53 Example: 20 most popular hashtags in the last 10 minutes of tweet stream
23:15 Smart window-based reduce
25:24 Sort transform by key on hashtags
27:09 Demo using AWS
29:39 Other Operations, Maintaining state, tracking sessions
30:45 Performance, Can process 6 GB/sec (60M records/sec)
on 100 nodes at sub-second latency, Grep, WordCount
31:32 Comparison
Spark Streaming: 670k records/sec/node
Storm: 115k records/sec/node
Apache S4: 7.5k records/sec/node
32:30 Fast Fault Recovery, recovers from faults/stragglers within 1 sec
32:53 Real Applications: Conviva real-time monitoring of video metadata
34:05 Real Applications: Mobile Millennium Project, traffic estimation
Markov chain Monte Carlo simulations on GPS observations
35:39 Failure semantics
35:53 Java API for Streaming
36:06 Contributors, 5 from UC Berkeley, 3 external contributors
36:12 Vison, one stop shop,
stream processing + Ad-hoc queries + batch processing
37:24 Questions
38:00 Strata Conference presentations on Berkeley Data Analytics Stack (BDAS)
38:37 Conclusion
New Streaming guide
Spark Streaming system in paper http://tinyurl.com/dstreams

Комментарии

Информация по комментариям в разработке