Using parallelize() method
By creating Tuples using the parallelize() method, Pair RDD can be created
scala> val inputrdd = sc.parallelize(Seq(("key1", "val1"), ("key1", "val1")))
inputrdd: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[3] at parallelize at :21
scala> inputrdd.reduceByKey((x, y) => x.concat(y)).collect()
res0: Array[(String, String)] = Array((key1,val1val1))
Note :
reduceByKey() is available in class PairRDDFunctions
reduceByKey() is available in class PairRDDFunctions
Using map() method
By using map() method to create key/value pairs, Pair RDD can be created
scala> val inputrdd = sc.parallelize(List("key1 1", "key1 3"))
inputrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[15] at parallelize at :21
scala> val mapped = inputrdd.map { x =>
| val splitted = x.split(" ")
| (splitted(0), splitted(1).toInt)
| }
mapped: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[16] at map at :23
scala> val reduced = mapped.reduceByKey((x, y) => x + y)
reduced: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[17] at reduceByKey at :25
scala> reduced.collect()
res24: Array[(String, Int)] = Array((key1,4))