RDDs are Recomputed
RDDs by default is recomputed each time an action is run on them. For example,
scala> val lines = sc.textFile("words.txt")
...
scala> lines.first()
res4: String = line1 word1
scala> lines.count()
res5: Long = 4
Here the call to action first() computes the RDD 'lines'. Again when we use another action 'count()' on the same RDD, the RDD is recomputed once again
Persisting RDDs
We use persist() method to persist an RDD. In Scala & Java, by default, persist() will store the data in JVM as unserialized object. In Python, calling persist() will serialize the data before persisting. Options to store in Memory/Disk combination is also possible.
scala> val lines = sc.textFile("words.txt")
...
...
scala> lines.persist(StorageLevel.MEMORY_ONLY) //We can also use cache() method if we need MEMORY_ONLY storage level
...
scala> lines.count() (1)
...
The actual persistence takes place during the first (1) action call on the RDD. Spark provides multiple Storage options(Memory/Disk) to persist the data as well as Replication Levels. More information can be found here
We use unpersist() to unpersist RDD. When the cached data exceeds the Memory capacity, Spark automatically evicts the old partitions(it will be recalculated when needed). This is called Last Recently used Cache(LRU) policy
The actual persistence takes place during the first (1) action call on the RDD. Spark provides multiple Storage options(Memory/Disk) to persist the data as well as Replication Levels. More information can be found here
We use unpersist() to unpersist RDD. When the cached data exceeds the Memory capacity, Spark automatically evicts the old partitions(it will be recalculated when needed). This is called Last Recently used Cache(LRU) policy