In this example, we will count the number of lines in a Text File. Create a text file with random Content
mountain@mountain:~$ cat data.txt
line 1
line 2
Open Scala Shell and execute the following Statements
mountain@mountain:~$ spark-shell
...
Spark context available as sc (1)
scala> val lines = sc.textFile("data.txt") (2)
...
lines: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[1] at textFile at <console>:21
scala> lines.count() (3)
scala> lines.first()
...
res1: String = line 1
You can exit from shell using,
scala> exit
Controlling the Logs
You must have realized by now, the Scala Shell prints overwhelming amount of logs which you might not need all the time. The amount of logs printed on the Shell can be controlled by setting the property log4j.rootCategory in file log4j.properties
In $SPARK_HOME/conf,
mountain@mountain:~/sk/conf$ cp log4j.properties.template log4j.properties
mountain@mountain:~/sk/conf$ vi log4j.properties
log4j.rootCategory=WARN, console
Setting the log level to WARN will reduce the amount of logs printed on the Shell. Let us dissect the output from our simple program and grasp few important concepts of Spark
Driver
A Driver Program is akin to main() method in programming languages like C, C++ & java. The Driver Program which is part of a Spark Application launches the Application into Spark Cluster. In our case, The Scala Shell acts as a Driver Program
Spark Context (sc)
In Spark, we access the cluster through object of type
SparkContext. The Scala shell by default provides the variable
sc(1) which is of type
SparkContext
Resilient Distributed Datasets(RDD)
RDD is the basic unit of Data in Spark upon which all Operations are done. In the example above, lines(2) is an RDD. count()(3) is one of the Operation we have performed on the RDD