groupByKey() operates on Pair RDDs and is used to group all the values related to a given key. groupBy() can be used in both unpaired & paired RDDs. When used with unpaired data, the key for groupBy() is decided by the function literal passed to the method
Example
Note
groupByKey() always results in Hash-Partitioned RDDs
Learning Spark : Hash-Partition : 64
Example
scala> val inputrdd = sc.parallelize(Seq( | ("key1", 1), | ("key2", 2), | ("key1", 3))) inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[12] at parallelize at:21 //groupByKey() Example scala> val grouped1 = inputrdd.groupByKey() grouped1: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[13] at groupByKey at :23 scala> grouped1.collect() res6: Array[(String, Iterable[Int])] = Array((key1,CompactBuffer(1, 3)), (key2,CompactBuffer(2))) //groupBy() Example : Find Odd & Even numbers scala> val grouped2 = inputrdd.groupBy{ x => | if((x._2 % 2) == 0) { | "evennumbers" | }else { | "oddnumbers" | } | } grouped2: org.apache.spark.rdd.RDD[(String, Iterable[(String, Int)])] = ShuffledRDD[15] at groupBy at :23 scala> grouped2.collect() res7: Array[(String, Iterable[(String, Int)])] = Array((evennumbers,CompactBuffer((key2,2))), (oddnumbers,CompactBuffer((key1,1), (key1,3))))
Note
groupByKey() always results in Hash-Partitioned RDDs
Reference
Learning Spark : Hash-Partition : 64