24 November 2015

takeSample() Example

takeSample() is an action that is used to return a fixed-size sample subset of an RDD

Syntax
def takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T]
Return a fixed-size sampled subset of this RDD in an array
withReplacement whether sampling is done with replacement
num             size of the returned sample
seed            seed for the random number generator
returns         sample of specified size in an array

Example
scala> val inputrdd = sc.parallelize{ Seq(10, 4, 5, 3, 11, 2, 6) }
inputrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at :47

scala> inputrdd.takeSample(false, 3, System.nanoTime.toInt)
res29: Array[Int] = Array(6, 11, 10)

scala> inputrdd.takeSample(false, 3, System.nanoTime.toInt)
res30: Array[Int] = Array(5, 11, 4)

scala> inputrdd.takeSample(true, 3, System.nanoTime.toInt)
res31: Array[Int] = Array(10, 11, 5)

Reference


Learning Spark : 41
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD