31 May 2015

RDD : Lazy Evaluation & Lineage Graph

Lazy Evaluation


Lazy Evaluation helps to optimize the Disk & Memory Usage in Spark. Consider this example,

mountain@mountain:~/sbook$ cat words.txt 
line1 word1
line2 word2 word1  
line3 word3 word4
line4 word1

scala> val lines = sc.textFile("words.txt") //Transformation(1)
...
scala> val filtered = lines.filter(line => line.contains("word1"))
...
scala> filtered.first() //Action(2)
res0: String = line1 word1

Based on the code above, we would infer that the file 'words.txt' will be read during the execution of  Transformation operation (1). But this never happens in Spark. Instead, the file will only be read during the execution of action operation (2). The benefit of this Lazy Evaluation is, we only need to read the first line from the File instead of the whole file & also there is no need to store the complete file content in Memory

Thus we can say that, Transformations in Spark is Lazily evaluated and Spark will not evaluate the Transformations until it sees an action.

Lineage Graph


When we create new RDDs based on the existing RDDs, Spark manage these dependencies using Lineage Graph.