In this post we demonstrate the following methods,
cat result/*
1
2
3
4
6
7
8
9
10
- Transformation
- filter()
- Actions
- take()
- collect()
- saveAsTextFile()
scala> val inputRDD = sc.parallelize(1 to 10) inputRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at:21 scala> //Filter number '5' scala> val filteredRDD = inputRDD.filter(x => x != 5) filteredRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at filter at :23 scala> //Collect all the records scala> //Warning : This will collect the entire data in RDD scala> // into local machine memory scala> filteredRDD.collect() res15: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9, 10) scala> //Collect only 2 records from the RDD into Local Machine meory scala> filteredRDD.take(2) res16: Array[Int] = Array(1, 2) scala> //Save the result to text file scala> filteredRDD.saveAsTextFile("./result")
cat result/*
1
2
3
4
6
7
8
9
10