28 May 2016

A Word Count example using 'spark-shell'

  1. [raj@Rajkumars-MacBook-Pro ~]$spark-shell --master local[*]
  2. 2016-05-28 15:37:24.325 java[3907:6309927] Unable to load realm info from SCDynamicStore
  3. Welcome to
  4. ____ __
  5. / __/__ ___ _____/ /__
  6. _\ \/ _ \/ _ `/ __/ '_/
  7. /___/ .__/\_,_/_/ /_/\_\ version 1.6.1
  8. /_/
  9.  
  10. Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
  11. Type in expressions to have them evaluated.
  12. Type :help for more information.
  13. Spark context available as sc.
  14. SQL context available as sqlContext.
  15.  
  16. scala> val lines = sc.parallelize(List("This is a word", "This is another word"), 7)
  17. lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :27
  18. scala> // No of partitions
  19. scala> lines.partitions.size
  20. res0: Int = 7
  21. scala> // flatMap() : One of many transformation
  22. scala> val words = lines.flatMap(line => line.split(" "))
  23. words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at flatMap at :29
  24. scala> val units = words.map ( word => (word, 1) )
  25. units: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[2] at map at :31
  26. scala> val counts = units.reduceByKey ( (x, y) => x + y )
  27. counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[3] at reduceByKey at :33
  28. scala> counts.toDebugString
  29. res1: String =
  30. (7) ShuffledRDD[3] at reduceByKey at :33 []
  31. +-(7) MapPartitionsRDD[2] at map at :31 []
  32. | MapPartitionsRDD[1] at flatMap at :29 []
  33. | ParallelCollectionRDD[0] at parallelize at :27 []
  34. scala> // collect() : One of many actions
  35. scala> counts.collect()
  36. res2: Array[(String, Int)] = Array((This,2), (is,2), (another,1), (a,1), (word,2))