54 - Spark RDD - PairRDD - GroupByKey

Описание к видео 54 - Spark RDD - PairRDD - GroupByKey

‪@backstreetbrogrammer‬

--------------------------------------------------------------------------------
Chapter 10 - Spark RDD - PairRDD - GroupByKey
--------------------------------------------------------------------------------
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed "shuffle" operations, such as grouping or aggregating the elements by a key.

In Java, key-value pairs are represented using the scala.Tuple2 class from the Scala standard library. We can simply call new Tuple2(a, b) to create a tuple, and access its fields later with tuple._1() and tuple._2().

RDDs of key-value pairs are represented by the JavaPairRDD class. We can construct JavaPairRDD from JavaRDD using special versions of the map operations, like mapToPair and flatMapToPair. The JavaPairRDD will have both standard RDD functions and special key-value ones.

One big difference between a Java Map and Spark's JavaPairRDD is that Map should contain unique keys but JavaPairRDD can have duplicate keys.

For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:

final var lines = sc.textFile("data.txt");
final var pairs = lines.mapToPair(s -: new Tuple2(s, 1L));
final var counts = pairs.reduceByKey(Long::sum);

We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.

When using custom objects as the key in key-value pair operations, we must be sure that a custom equals() method is accompanied by a matching hashCode() method.


Github: https://github.com/backstreetbrogramm...

Apache Spark for Java Developers Playlist:    • Apache Spark for Java Developers  
Top Java Coding Interview Problems Playlist:    • Top Java Coding Interview Problems  
Java Serialization Playlist:    • Java Serialization  
Dynamic Programming Playlist:    • Dynamic Programming  

#java #javadevelopers #javaprogramming #apachespark #spark

Комментарии

Информация по комментариям в разработке