In this example, we will count the number of lines in a Text File. Create a text file with random Content
mountain@mountain:~$ cat data.txt
line 1
line 2
You can exit from shell using,
scala> exit
A Driver Program is akin to main() method in programming languages like C, C++ & java. The Driver Program which is part of a Spark Application launches the Application into Spark Cluster. In our case, The Scala Shell acts as a Driver Program
In Spark, we access the cluster through object of type SparkContext. The Scala shell by default provides the variable sc(1) which is of type SparkContext
An RDD Operation can be any one of this type, actions or transformations.
action returns result to the Driver Program or write it to the Storage. An action normally starts a Computation to provide result and always return some other data type other than RDD
transformation returns Pointer to new RDD
Check the link below for some of the common actions & transformations
https://spark.apache.org/docs/latest/programming-guide.html#actions
https://spark.apache.org/docs/latest/programming-guide.html#transformations
In a typical cluster environment an operation is normally parallelized in multiple Nodes.
mountain@mountain:~$ cat data.txt
line 1
line 2
Open Scala Shell and execute the following Statements
mountain@mountain:~$ spark-shell
...
Spark context available as sc (1)
scala> val lines = sc.textFile("data.txt") (2)
Spark context available as sc (1)
scala> val lines = sc.textFile("data.txt") (2)
...
lines: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[1] at textFile at <console>:21
scala> lines.count() (3)
...
res0: Long = 2
scala> lines.first()
...
res1: String = line 1
You can exit from shell using,
scala> exit
Controlling the Logs
You must have realized by now, the Scala Shell prints overwhelming amount of logs which you might not need all the time. The amount of logs printed on the Shell can be controlled by setting the property log4j.rootCategory in file log4j.properties
In $SPARK_HOME/conf,
mountain@mountain:~/sk/conf$ cp log4j.properties.template log4j.properties
mountain@mountain:~/sk/conf$ vi log4j.properties
log4j.rootCategory=WARN, console
Setting the log level to WARN will reduce the amount of logs printed on the Shell. Let us dissect the output from our simple program and grasp few important concepts of Spark
In $SPARK_HOME/conf,
mountain@mountain:~/sk/conf$ cp log4j.properties.template log4j.properties
mountain@mountain:~/sk/conf$ vi log4j.properties
log4j.rootCategory=WARN, console
Setting the log level to WARN will reduce the amount of logs printed on the Shell. Let us dissect the output from our simple program and grasp few important concepts of Spark
Driver
A Driver Program is akin to main() method in programming languages like C, C++ & java. The Driver Program which is part of a Spark Application launches the Application into Spark Cluster. In our case, The Scala Shell acts as a Driver Program
Spark Context (sc)
In Spark, we access the cluster through object of type SparkContext. The Scala shell by default provides the variable sc(1) which is of type SparkContext
Resilient Distributed Datasets(RDD)
RDD is the basic unit of Data in Spark upon which all Operations are done. In the example above, lines(2) is an RDD. count()(3) is one of the Operation we have performed on the RDD
An RDD Operation can be any one of this type, actions or transformations.
action returns result to the Driver Program or write it to the Storage. An action normally starts a Computation to provide result and always return some other data type other than RDD
transformation returns Pointer to new RDD
Check the link below for some of the common actions & transformations
https://spark.apache.org/docs/latest/programming-guide.html#actions
https://spark.apache.org/docs/latest/programming-guide.html#transformations
In a typical cluster environment an operation is normally parallelized in multiple Nodes.