Distributed Computing
History
Distributed Systems, distributed programming and distributed algorithms were originally referred to as computer networks. The individual computers were physically distributed within some geographical area and communicated with each other via message passing.
An early example from the early 1970s is the ARPANET's Email Application.
In a network topology, which is basically a graph, each node has its own private memory and processor. Further, each node communicates with the other nodes via messages.
Another type of computing is parallel computing. It can be thought of as particular tightly coupled form of distributed computing. The main difference is that, in parallel computing all processors may have access to shared memory for exchange information.
Thus, at a high level, distributed computing implies multiple CPUs each with its own memory. Parallel computing uses multiple CPUs sharing the same memory.
Hadoop Ecosystem
While distributed computing and big data applications have existed for a long time, the big data boom is associated with the emergence of Hadoop Ecosystem.
The Hadoop Ecosystem was founded in 2006 based on previous Google Research published in early 2000s. Its an ecosystem of tools for big data storage and data analysis.
The Hadoop Framework consists of 4 main components:
Hadoop Distributed File System (HDFS): Stores data on commodity machines with application, hance providing very high aggregate bandwidth across the cluster. More specifically, its a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.
Hadoop MapReduce: An implementation of the Map Reduce programming model for large scale data processing. Thus, it's a system for processing and analyzing large data sets in parallel.
Hadoop YARN: A Resource Manager that schedules computational workload of users applications. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
Hadoop Common: Contains libraries and utilities needed by other Hadoop modules.
As Hadoop matured, other tools were developed to provide higher level abstraction on top of this Java Framework. These tools included:
Apache Pig: A SQL-like language that runs on top of Hadoop MapReduce
Apache Hive: Another SQL-like interface that runs on top of Hadoop MapReduce
Spark: Relation to Hadoop
The problem with traditional Map Reduce approach is that after each step, the results are saved to disk. Since Disk IO is relatively slow, so these jobs can take too long.
This slowness was the motivation for the starting of Spark Project at UC Berkeley's AmpLab in 2009.
Spark is another big data framework like Hadoop. It contains libraries for data analysis, machine learning, graph analysis, and streaming live data.
Spark does not write intermediate results to disk and can perform fast in-memory computations for multiple Map and Reduce steps. Thus, Spark is generally faster than Hadoop.
Further, Spark, does not include a file storage system. One can use Spark on top of HDFS but its not a requirement. Spark can read in data from other sources as well such as Amazon S3.
Streaming Data
Data streaming is a specialized topic in big data. The use case is when one wants to store and analyze data in real-time such as Facebook posts or Twitter tweets.
Spark has a streaming library called Spark Streaming although it is not as popular and fast as some other streaming libraries. Popular streaming libraries include Storm and Flink.
MapReduce
MapReduce is a programming technique for manipulating large data sets. "Hadoop MapReduce" is a specific implementation of this programming technique.
Following are the steps of how this technique works:
Divide a large dataset and distribute it across a cluster.
In the map step, each data point is analyzed and converted into a (key, value) pair.
The key-value pairs are then shuffled across the cluster so that all keys are on the same machine.
In the reduce step, the values with the same keys are combined together.
While Spark doesn't implement MapReduce, one can write Spark programs that behave in a similar way to the map-reduce paradigm.
Illustration
Consider a 4 node cluster to calculate "how many times each song was played in the last year in a music streaming service". The log data containing all the events of playing a song is 100s of gigabytes and cannot be processed on a single machine.
As mentioned above, in a map reduce job, there are 3 well-defined steps: map, shuffle and reduce.
Let's assume, the big log file is stored in HDFS. First, the data would be broken into smaller chunks, so that commodity machines can handle the computational load. HDFS takes care of this. These chunks of data are called partitions.
In the first step, each of the 4 map processors processes a partition from disk, transforms each record in that given partition, then writes the modified records to an intermediate file.
The transformation in this case consists of 5 steps:
Read each line of the log file
Check if the event is "listening to a song".
Check if the timestamp is in the correct range.
Extract the name of the song.
Finally, create a tuple with the name of the song and number 1. E.g. (Despacito, 1).
At the end of this step, there would be multiple intermediate files containing pairs of a song title and the number 1.
Even with 4 machines, data can be so large such that it would require many map processes one after the another to process all the records in the log file.
After completion of map processes, the next step is shuffle. All the records from the intermediate files are shuffled across the cluster, such that pairs of the same song title end up on the same machine. This way when each of the 4 nodes aggregates the values for key, it can be sure it has all the corresponding data for a particular song title and thus can compute the correct final result.
In the final step, the reduce step, the values for a given key are combined. In the present case, all the 1s are summed up for a given key. Lastly, the count results are written to the output file.
Spark Cluster
Most computational frameworks including Spark are organized into master worker hierarchy. The master node is responsible for orchestrating the tasks across the cluster, while the workers are performing the actual computations.
There are 4 different modes to setup Spark:
Local Mode: In this case, everything happens on a single machine. So, while Spark APIs are used in this mode, no distributed computing is done. Local Mode is useful to learn syntax and to prototype a project.
Distributed Mode: All the other 3 modes are distributed and require a cluster manager. The cluster manager is a seperate process that monitors available resources. It also make sures that all machines are responsive during the job. There are 3 different options of cluster managers:
Standalone Cluster Manager (Spark): Has a driver process. Acts has the master and is responsible for scheduling tasks that the executors will perform.
YARN (Hadoop)
Mesos (Open Source Manager from UC Berkeley)
YARN and Mesos are useful when one is sharing a cluster with a team.
Spark Use Cases
Two important use cases of Spark are:
ETL: Extract, Load and Transform data
Machine Learning: Train machine learning models on big data. Spark is particularly useful for itertiave algorithms, like Logistic Regression or calculating Page Rank.
Last updated