Data Wrangling with Spark

Functional Programming

Under the hood, Spark is written in a functional programming language called Scala. Although, one can use Spark with other languages such as Java, Python and R.

Even though Python is not a functional programming language, when using PySpark, one can experience the functional programming influence of scala.

The PySpark API allows you to write programs in Spark and ensures that your code uses functional programming practices. Underneath the hood, the Python code uses py4j to make calls to the Java Virtual Machine (JVM).

The reason Spark uses functional programming is because it is perfect for distributed systems.

In distributed systems, one needs to be careful how they design their functions. Whenever functions run on some input data, it can alter it in the process. Thus, one needs to write input functions that preserve their inputs and avoid side effects. These functions are called peer functions. This would allow the spark code to work well at the scale of big data.

Spark DAGs: Recipe for Data

Every Spark function makes a copy of its input data and never changes the original parent data. Because Spark doesn't change or mutate the input data, its known as immutable.

In Spark, when there are multiple functions (usually the case) that each accomplish a small chunk of work, they are chained together. Also, a function is often composed of multiple sub-functions and in order for the big function to be peer, each sub-function also has to be peer.

It would seem the case that Spark needs to make a copy of the input data for each sub-function. If this was the case, a Spark program would run out of memory pretty quickly.

Fortunately, Spark avoids this using a functional programming concept called lazy evaluation. Before Spark does not anything with the data for the concerned program, it first builds step by step directions of what functions and data it will need. These are called Directed Acyclic Graph or DAG.

Once Spark builds the DAG from the code, it checks if it can procrastinate waiting until the last possible moment to get the data. Also, in Spark, the multi-step combos are called stages.

Maps & Lambda Functions

Maps simply make a copy of the original input data and transform that copy according to whatever function one puts inside the map.

Following code block gives an example of mapping a "log of songs" to lowercase:

import findspark
findspark.init('spark-2.3.2-bin-hadoop2.7')

import pyspark
sc = pyspark.SparkContext(appName="maps_and_lazy_evaluation_example")

log_of_songs = [
        "Despacito",
        "Nice for what",
        "No tears left to cry",
        "Despacito",
        "Havana",
        "In my feelings",
        "Nice for what",
        "despacito",
        "All the stars"
]

# parallelize the log_of_songs to use with Spark
distributed_song_log = sc.parallelize(log_of_songs)

def convert_song_to_lowercase(song):
    return song.lower()

distributed_song_log_lower = distributed_song_log.map(convert_song_to_lowercase)

print(distributed_song_log_lower.collect())
print(distributed_song_log.collect())

Following are the details about the above code block:

The findspark Python module makes it easier to install Spark in local mode on one's computer. This is convenient for practicing Spark syntax locally.
After that, SparkContext object is created. With the SparkContext, one can input a dataset and parallelize the data across a cluster. Though, using Spark in local mode on a single machine, technically the dataset isn't distributed.
After that, a function is defined that converts a song title to lowercase.
Then, the map step will go through each song in the list and apply the convert_song_to_lowercase() function. This line of code will run quite quickly. Spark does not actually execute the map step unless it needs to. The output of this step is an "RDD" which refers to resilient distributed dataset. RDDs are exactly what they say they are: fault-tolerant datasets distributed across a cluster. This is how Spark stores data.
Finally, to get Spark to actually run the map step, one needs to use an "action". One available action is the collect method. The collect() method takes the results from all of the clusters and "collects" them into a single list on the master node.
Note as well that Spark is not changing the original data set: Spark is merely making a copy. One can see this by running collect() on the original dataset.

One does not always have to write a custom function for the map step. Anonymous (lambda) functions can also be used as well as built-in Python functions like string.lower().

Anonymous functions are actually a Python feature for writing functional style programs. Following is an illustration for this:

distributed_song_log.map(lambda song: song.lower()).collect()

Data Formats

Before we can start any data wrangling, one needs to load the data in Spark. The most common data formats are CSV, JSON, HTML and XML.

Big data itself often needs to be stored in a distributed way as well. Distributed file systems and distributed databases store data in a fault tolerant way. So, if a machine breaks or becomes unavailable, the collected information is not lost.

Hadoop has a distributed file system, HDFS to store data. HDFS splits files into 64 and 128 MB blocks and replicates these blocks across the cluster. This way, the data is stored in a fault tolerant way and can be accessed in digestible chunks.

SparkSession

The first component of each Spark program is the SparkContext. The SparkContext is the main entry point for Spark functionality and connects the cluster with the application.

If one would like to use lower level abstractions, then objects need to be created with SparkContext. To create a SparkContext, we first need a SparkConf object to specify some information about the application such as it's name and master's node IP address. Following code block is an illustration:

from pyspark import SparkContext, SparkConf
configure = SparkConf().setAppName('name').setMaster('IP Address')
sc = SparkContext(conf = configure)

If one plan's to run Spark local mode, following is how to setup SparkContext

from pyspark import SparkContext, SparkConf
configure = SparkConf().setAppName('name').setMaster('local')
sc = SparkContext(conf = configure)

To read data frames, one would need to use SparkContext's SparkSQL equivalent SparkSession. Similarly to the SparkConf, one can specify some parameters to create a SparkSession.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app name'). \
                            .config('config option', 'config value'). \
                            .getOrCreate()

In the above code block, getOrCreate is used, if a SparkSession is already running, instead of creating a new one, the old one will be returned and its parameters will be modified to the new configurations.

PreviousDistributed Computing NextData I/O

Last updated 3 years ago