29 May 2015

Line Count using Scala Shell

In this example, we will count the number of lines in a Text File. Create a text file with random Content
mountain@mountain:~$ cat data.txt
line 1
line 2

Open Scala Shell and execute the following Statements
mountain@mountain:~$ spark-shell
...
Spark context available as sc (1)
scala> val lines = sc.textFile("data.txt") (2)
...
lines: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[1] at textFile at <console>:21
scala> lines.count() (3)
...
res0: Long = 2
scala> lines.first()
...
res1: String = line 1

You can exit from shell using,
scala> exit

Controlling the Logs


You must have realized by now, the Scala Shell prints overwhelming amount of logs which you might not need all the time. The amount of logs printed on the Shell can be controlled by setting the property log4j.rootCategory in file log4j.properties

In $SPARK_HOME/conf,
mountain@mountain:~/sk/conf$ cp log4j.properties.template log4j.properties
mountain@mountain:~/sk/conf$ vi log4j.properties
log4j.rootCategory=WARN, console

Setting the log level to WARN will reduce the amount of logs printed on the Shell. Let us dissect the output from our simple program and grasp few important concepts of Spark

Driver


A Driver Program is akin to main() method in programming languages like C, C++ & java. The Driver Program which is part of a Spark Application launches the Application into Spark Cluster. In our case, The Scala Shell acts as a Driver Program

Spark Context (sc)


In Spark, we access the cluster through object of type SparkContext. The Scala shell by default provides the variable sc(1) which is of type SparkContext 

Resilient Distributed Datasets(RDD)


RDD is the basic unit of Data in Spark upon which all Operations are done. In the example above, lines(2) is an RDD. count()(3) is one of the Operation we have performed on the RDD

An RDD Operation can be any one of this type, actions or transformations.

action returns result to the Driver Program or write it to the Storage. An action normally starts a Computation to provide result and always return some other data type other than RDD

transformation returns Pointer to new RDD

Check the link below for some of the common actions & transformations

https://spark.apache.org/docs/latest/programming-guide.html#actions
https://spark.apache.org/docs/latest/programming-guide.html#transformations

In a typical cluster environment an operation is normally parallelized in multiple Nodes.