groupByKey() operates on Pair RDDs and is used to group all the values related to a given key. groupBy() can be used in both unpaired & paired RDDs. When used with unpaired data, the key for groupBy() is decided by the function literal passed to the method
Example
Note
groupByKey() always results in Hash-Partitioned RDDs
Learning Spark : Hash-Partition : 64
Example
scala> val inputrdd = sc.parallelize(Seq(
| ("key1", 1),
| ("key2", 2),
| ("key1", 3)))
inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[12] at parallelize at :21
//groupByKey() Example
scala> val grouped1 = inputrdd.groupByKey()
grouped1: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[13] at groupByKey at :23
scala> grouped1.collect()
res6: Array[(String, Iterable[Int])] = Array((key1,CompactBuffer(1, 3)), (key2,CompactBuffer(2)))
//groupBy() Example : Find Odd & Even numbers
scala> val grouped2 = inputrdd.groupBy{ x =>
| if((x._2 % 2) == 0) {
| "evennumbers"
| }else {
| "oddnumbers"
| }
| }
grouped2: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[15] at groupBy at :23
scala> grouped2.collect()
res7: Array[(String, Iterable[(String, Int)])] = Array((evennumbers,CompactBuffer((key2,2))), (oddnumbers,CompactBuffer((key1,1), (key1,3))))
Note
groupByKey() always results in Hash-Partitioned RDDs
Reference
Learning Spark : Hash-Partition : 64