Hadoop's Overhead - Our Experiments with Big Data

While advising a client with a strong interest in ARM servers, we decided to evaluate the computational overhead of various big data technologies, which led to some interesting discoveries. Since we in the field are all trying to figure out how big data technology will evolve, Flux7 Labs thought we’d share some of what we’ve learned.

The Question

The question we tried to answer was, “Is Hadoop a good candidate for microserver workloads?” Big Data workloads, with their high reliance on memory and network, are often touted as perfect candidates for moving to microservers. Among Big Data workloads, Hadoop has become the poster child for applications, and microserver vendors are keenly interested in seeing Hadoop ported over to their platforms. Our theory was that Hadoop is a very high-overhead application because it was designed to utilize excess CPU capacity to conserve disk and network bandwidth. This would make Vanilla Hadoop without tuning, and rearchitecting unsuitable for microservers due to their limited CPU horsepower and excess disk and network capacities. To test our theory, we conducted a brief experiment to compare application performance when running Hadoop, Apache Twitter and native..

The Experiment

Our experiment was based on a project previously coded by our CTO, Ali Hussain, for an assignment in the Coursera course, “Introduction to Data Science” (https://www.coursera.org/course/datasci). Ali’s project was a compute-heavy application for performing sentiment analysis on a Twitter stream. You can find his code at https://github.com/Flux7Labs/twitter-fun. The code classifies tweets as positive, negative or neutral based on their content. Most of the work is performed by parsing text and looking up the words in a dictionary in order to classify them. We rewrote the original Python program in Java. We ran it natively, as a Storm bolt and as a Hadoop MapReduce job.

We ran the experiment under a single-node Amazon small instance—natively, using Hadoop under a pseudo distributed mode, and using a Storm deployment. The data consisted of 1GB comprising 760,000 tweets.

The Results

As we expected, performance was fastest when running the native app, followed by Storm, and with Hadoop bringing up the rear.

Table 1: Throughput, Latency and Run time in Native ST, Hadoop Single Node and Storm Single Node


Native ST


Storm single-node

Throughput (tweets/s)




Latencyper tweet(ms)




Run time(s)




Throughput, Latency and Run time in Native ST, Hadoop Single Node and Storm Single Node

Figure 1: Throughput, Latency and Run time in Native ST, Hadoop Single Node and Storm Single Node

Overhead proved to be exactly what we expected: The Storm and Hadoop instances did significantly more work. What surprised us was that Hadoop’s overhead was nearly 6x higher than native. At a time when “Hadoop” and “Big Data” have become marketing buzzwords, we think it’s always important to question whether or not Hadoop is actually the best choice for the specific needs of any particular company. We believe big data solutions that use complicated distributed systems should be used only for solving real business needs, and not just because a solution’s name is trending. Hadoop and similar platforms are invaluable when addressing needs that rely on their strength, but they’re often no more than expensive overkill in terms of overhead.

The most interesting result of our experiment was learning how Hadoop and Storm split user and system times differently. In Table 2 we present the results in linear form and in Figure 2 in graphical form.

Table 2: User vs. System Time in the Hadoop vs. Storm Matchup




user ms/tweet



system ms/tweet



User vs. System Time in the Hadoop vs. Storm Matchup

Figure 2: User vs. System Time in the Hadoop vs. Storm Matchup

The results above support our original theory that Hadoop is a very heavyweight solution designed to address problems by CPU-intensive means. It partitions data into chunks, and each chunk is reread and combined multiple times. This keeps a user from injecting too much traffic into the network and onto the disk, but it’s also expensive due to its heavy reliance on the CPU. At the other end of the spectrum, we see that Storm has more than twice the system time of Hadoop. Storm works by receiving data and sending it directly to the next bolt. Since it doesn’t try to do any other work on the data, Storm’s system penalty is higher than Hadoop’s, but the overall time is lower.

Survival Of The Fittest

These test results remind us how big data hardware and software have developed. It’s analogous to the theory of evolution. One confusing concept in the theory of evolution is survival of the fittest. The key to understanding survival of the fittest is to recognize that “fittest” is not based on an absolute, across-the-board scale, but rather on what fits best within the environment or ecosystem of a particular organism. In the world of big data, software and hardware constitute an ecosystem in which each creates demands on the other that are defined by their limitations. What has traditionally been considered “fit” in the world of big data applications is that network and disk are at a premium while compute is freely available. It will require a slow, extensive transformation of thinking to change that point of view as more nuanced solutions emerge over time.

Apples To Oranges

Admittedly, ours was a limited experiment. The results are by no means conclusive because each system was doing a lot of different things, and we recognize that each system presents different strengths and weaknesses in meeting specific workload requirements. Many factors play into such a comparison, including relative network and disk performance, and the sizes of nodes. However, we believe the results of our research provide food for thought on the design of future applications and frameworks.

February 03, 2014 /

About the Author

Ali Hussain