Spark Basics: filter(), collect() & more...

14 November 2015

filter(), collect() & more...

In this post we demonstrate the following methods,

Transformation

filter()

Actions

take()
collect()
saveAsTextFile()

scala> val inputRDD = sc.parallelize(1 to 10)
inputRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at :21

scala> //Filter number '5'
scala> val filteredRDD = inputRDD.filter(x => x != 5)
filteredRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at filter at :23

scala> //Collect all the records
scala> //Warning : This will collect the entire data in RDD
scala> //          into local machine memory
scala> filteredRDD.collect()
res15: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9, 10)

scala> //Collect only 2 records from the RDD into Local Machine meory
scala> filteredRDD.take(2)
res16: Array[Int] = Array(1, 2)

scala> //Save the result to text file
scala> filteredRDD.saveAsTextFile("./result")

cat result/*
1
2
3
4
6
7
8
9
10