05 December 2015

mapValues() Example

When we use map() with a Pair RDD, we get access to both Key & value. There are times we might only be interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

In this example we use mapValues() along with reduceByKey() to calculate average for each subject

scala> val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[29] at parallelize at :21

scala> val mapped = inputrdd.mapValues(mark => (mark, 1));
mapped: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[30] at mapValues at :23

scala> val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[31] at reduceByKey at :25

scala> val average = reduced.map { x =>
     |                      val temp = x._2
     |                      val total = temp._1
     |                      val count = temp._2
     |                      (x._1, total / count)
     |                      }
average: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at map at :27

scala>
     | average.collect()
res30: Array[(String, Int)] = Array((english,65), (maths,55))

Note 

Operations like map() always cause the new RDD to no retain the parent partitioning information

Reference

Learning Spark : Partitioning : 64