Using parallelize() method
By creating Tuples using the parallelize() method, Pair RDD can be created
scala> val inputrdd = sc.parallelize(Seq(("key1", "val1"), ("key1", "val1"))) inputrdd: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[3] at parallelize at:21 scala> inputrdd.reduceByKey((x, y) => x.concat(y)).collect() res0: Array[(String, String)] = Array((key1,val1val1))
Note :
reduceByKey() is available in class PairRDDFunctions
reduceByKey() is available in class PairRDDFunctions
Using map() method
By using map() method to create key/value pairs, Pair RDD can be created
scala> val inputrdd = sc.parallelize(List("key1 1", "key1 3")) inputrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[15] at parallelize at:21 scala> val mapped = inputrdd.map { x => | val splitted = x.split(" ") | (splitted(0), splitted(1).toInt) | } mapped: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at :23 scala> val reduced = mapped.reduceByKey((x, y) => x + y) reduced: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at :25 scala> reduced.collect() res24: Array[(String, Int)] = Array((key1,4))