When we use map() with a Pair RDD, we get access to both Key & value. There are times we might only be interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().
Note
Operations like map() always cause the new RDD to no retain the parent partitioning information
In this example we use mapValues() along with reduceByKey() to calculate average for each subject
scala> val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65))) inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[29] at parallelize at:21 scala> val mapped = inputrdd.mapValues(mark => (mark, 1)); mapped: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[30] at mapValues at :23 scala> val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[31] at reduceByKey at :25 scala> val average = reduced.map { x => | val temp = x._2 | val total = temp._1 | val count = temp._2 | (x._1, total / count) | } average: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at map at :27 scala> | average.collect() res30: Array[(String, Int)] = Array((english,65), (maths,55))
Note
Operations like map() always cause the new RDD to no retain the parent partitioning information