Hardware
CPU
CPU has few different functions, including:
Directing other components of a computer
Running mathematical calculations
A typical CPU operations takes on an average 0.4 nanosecond (one billionth, of a second).
The CPU can also store small amounts of data inside itself in what are called registers. These registers hold data that the CPU is working with at the moment.
For example, say one writes a program reading in a 40 MB data file and then analyzes the file. When the code is executed, the following happens:
The CPU instructs the computer to take the 40 MB from disk and store the data in memory (RAM)
If the task is to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together.
The accumulation of the sum is stored in a register while the CPU grabs the next number.
The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.
A 2.5 Gigahertz CPU means that the CPU processes 2.5 billion operations per second.
Below is an illustration to understand the processing power of such a CPU.
Twitter generates about 6,000 tweets per second, and each tweet contains 200 bytes. So in one day, Twitter generates data on the order of:
(6000 tweets / second) * (86400 seconds / day) * (200 bytes / tweet) = 104 billion bytes / day
Presuming for each operation, a CPU of 2.5 Gigahertz would process 8 bytes of data, it would take the following amount of time to process one day worth of tweets:
(104 billion bytes / day) / ( (2.5 billion operations / second) * (8 bytes / operations) ) = 5.2 seconds
Memory (RAM)
Memory or RAM is a place to store data before it goes to CPU. It is known to be "efficient, expensive, and ephemeral".
Operations in RAM are relatively fast compared to reading and writing from disk or moving data across a network. However, RAM is expensive, and data stored in RAM will get erased when a computer shuts down.
It takes 250 times longer to find and load a random byte from memory than to process that same byte with CPU. Concretely, making a memory reference on an average takes 250 ns.
In other words, the time it takes to load an hour's worth of random tweets from memory, a CPU could process a week of tweets.
Fortunately, the data is usually organized in memory such that it's lined up and ready for immediate processing by the CPU. By loading data sequentially, say, all the tweets for a single hour, this bottleneck can be avoided.
While memory is efficient for feeding data to CPU, there are two problems with it:
Memory is ephemeral. All the data in the memory is lost everytime the computer machines are shutdown. Thus, it simply isn't an options for data storage when one is holding onto valuable data about customers and business.
Memory is expensive. Most mid-tier laptops come with 8/16 gigs of memory.
Back in the late 1990s, when CPU speeds started to level off, many companies invested in very expensive hardware with lots of memory.
But an alternate strategy used by Google at that time was bypassing this expense by building distributed systems using commodity hardware. Rather than replying on lots of memory, Google leveraged long-term storage on cheap pre-used hardware.
A distributed system or cluster is a bunch of connected machines. These machines are usually called nodes.
Also, commodity hardware mentioned above refers to computers people buy if they are on a tight budget.
Using distributed clusters of commodity hardware has now become the industry standard. Those early systems at Google were the foundations of technologies like Apache Spark.
Storage (SSD or Magnetic Disk)
Storage is used for keeping data over long periods of time. When a program runs, the CPU will direct the memory to temporarily load data from long-term storage.
For a random read from SSD, it takes 16 microsecond (one millionth, of a second).
While long-term storage like a hard-drive disk is cheap and durable, it's much slower than memory. Loading data from a magnetic disk can be 200 times slower. Even the newer solid-state drives (SSD) are still about 15 times slower.
For example, processing an hour of tweets from memory would take about 30 miliseconds on a mid-tier laptop. It will be close to 0.5 second for loading the same data from a SSD storage or 4 seconds from a older magnetic hard drive.
This difference between milliseconds and seconds may seem small, but when working with gigabytes or terabytes of data, it really adds up.
Apache Spark was designed specifically to optimise on use of memory and avoid this problem.
Network
Network is the gateway for anything that one needs that isn't stored in a computer. The network could connect to other computers in the same room (a Local Area Network) or to a computer on the other side of the world, connected over the internet.
This is the last piece of hardware that is crucial to understanding big data. Moving data across a network from one computer machine to another is the most common bottleneck when working with big data.
For example, the same hour of tweets that take 0.5 second to process from SSD storage, would take 30 seconds to download from the Twitter API on a typical network. It usually takes 20 times longer to process data, when one has to download it from another machine first.
For this reason, distributed systems try to minimize shuffling data back and forth across different computers.
Since Apache Spark or any distributed technology for that matter uses a cluster of servers connected by a network, moving data around can't be avoided. One of the advantages of Apache Spark is that it only shuffles data between computers when it absolutely has to. Still minimizing network input and output is crucial to mastering Spark programming.
Key Ratios
Based on the above descption of different hardware components, following table summarizes some key ratios:
Hardware Component | Processing Ratio |
---|---|
CPU | 200x faster than memory |
Memory | 15x faster than SSD |
SSD | 20x faster than network |
Last updated