18 June 2016

Broadcast Variable Example

  1. scala> // Sending a value from Driver to Worker Nodes without
  2.  
  3. scala> // using Broadcast variable
  4.  
  5. scala> val input = sc.parallelize(List(1, 2, 3))
  6. input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:27
  7.  
  8. scala> val localVal = 2
  9. localVal: Int = 2
  10.  
  11. scala> val added = input.map( x => x + localVal)
  12. added: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[18] at map at <console>:31
  13.  
  14. scala> added.foreach(println)
  15. 4
  16. 3
  17. 5
  18.  
  19. scala> //** Local variable is once again transferred to worked nodes
  20.  
  21. scala> // for the next operation
  22.  
  23. scala> val multiplied = input.map( x => x * 2)
  24. multiplied: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at map at <console>:29
  25.  
  26. scala> multiplied.foreach(println)
  27. 4
  28. 6
  29. 2
  30.  
  1. scala> // Sending a read-only value using Broadcast variable
  2.  
  3. scala> // Can be used to send large read-only values to all worker
  4.  
  5. scala> // nodes efficiently
  6.  
  7. scala> val broadcastVar = sc.broadcast(2)
  8. broadcastVar: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(14)
  9.  
  10. scala> val added = input.map(x => broadcastVar.value + x)
  11. added: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[20] at map at <console>:31
  12.  
  13. scala> added.foreach(println)
  14. 5
  15. 3
  16. 4
  17.  
  18. scala> val multiplied = input.map(x => broadcastVar.value * x)
  19. multiplied: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[21] at map at <console>:31
  20.  
  21. scala> multiplied.foreach(println)
  22. 6
  23. 4
  24. 2
  25.  
  26. scala>
  27.