takeSample() is an action that is used to return a fixed-size sample subset of an RDD
Syntax
Example
Learning Spark : 41
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Syntax
def takeSample(withReplacement: Boolean, num: Int, seed: Long = Utils.random.nextLong): Array[T] Return a fixed-size sampled subset of this RDD in an array withReplacement whether sampling is done with replacement num size of the returned sample seed seed for the random number generator returns sample of specified size in an array
Example
scala> val inputrdd = sc.parallelize{ Seq(10, 4, 5, 3, 11, 2, 6) } inputrdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at:47 scala> inputrdd.takeSample(false, 3, System.nanoTime.toInt) res29: Array[Int] = Array(6, 11, 10) scala> inputrdd.takeSample(false, 3, System.nanoTime.toInt) res30: Array[Int] = Array(5, 11, 4) scala> inputrdd.takeSample(true, 3, System.nanoTime.toInt) res31: Array[Int] = Array(10, 11, 5)
Reference
Learning Spark : 41
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD