mapValues() Example

When we use map() with a Pair RDD, we get access to both Key & value. There are times we might only be interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

In this example we use mapValues() along with reduceByKey() to calculate average for each subject

scala> val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65)))
inputrdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[29] at parallelize at :21

scala> val mapped = inputrdd.mapValues(mark => (mark, 1));
mapped: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[30] at mapValues at :23

scala> val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
reduced: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[31] at reduceByKey at :25

scala> val average = reduced.map { x =>
     |                      val temp = x._2
     |                      val total = temp._1
     |                      val count = temp._2
     |                      (x._1, total / count)
     |                      }
average: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at map at :27

scala>
     | average.collect()
res30: Array[(String, Int)] = Array((english,65), (maths,55))

Note 

Operations like map() always cause the new RDD to no retain the parent partitioning information

Reference

Learning Spark : Partitioning : 64

11 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Scala, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Scala. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  3. Average can be calculated by executing below command .

    val average = reduced.mapValues(x=>x._1/x._2)

    Where x._1 is the total marks
    x._2 is the total instances

    ReplyDelete
  4. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Saurabh Srivastava
    MaxMunus
    E-mail: saurabh@maxmunus.com
    Skype id: saurabhmaxmunus
    Ph:+91 8553576305 / 080 - 41103383
    http://www.maxmunus.com/


    ReplyDelete
  5. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.

    apache spark training in electronic city

    ReplyDelete
  6. how does the average.collect know keys like maths or english because the last transformation performed was reduced.map can you clarify on that part please

    ReplyDelete
  7. I like your blog, I read this blog please update more content on python, further check it once at python online course

    ReplyDelete
  8. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training

    ReplyDelete
  9. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Certification

    ReplyDelete
  10. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.DataScience with Python Training in Bangalore




    ReplyDelete