Benchmarking m3.xlarge with Wikibench

In a recent post, we discussed the basics of the wikibench tool and how it is used for application benchmarking. Today, we’ll discuss how we used the Wikibench tool to benchmark m3.xlarge instance. The operating system used for the purpose of this benchmarking was 64-bit Ubuntu Precise 12.04.

We ran wikibench using different sampling rates (reduction in per milli). We used a single trace file for our experiments. Typically, a single trace file downloaded from the Wikibench website consists of request traces for one hour. We found empirically that m3.large can handle approximately ~24000 requests per hour. Based on this factor, we sampled the trace file into a number of samples, increasing the number of requests per hour from ~2K requests to ~45K requests.

The following table gives the sampling factor (reduction in per milli) and the number of requests in the sampled trace file. This table also provides the information on the number of missed requests. (Requests are missed because the web server cannot handle the request rate when the traces are replayed.)

Table 1

Here is a graph for the number of missed requests corresponding to the sampling factor.

diagram 1

While we ran the experiments, we also captured CPU load and memory, as well as disk I/O activity, using the collectd tool. We captured the network I/O using the CloudWatch monitoring provided in the Amazon AWS console.

Following are the corresponding graphs. But first, an explanation. We initially ran wikibench with the sampling factor of 980 (~45K requests) and gradually increased the sampling factor to 999 (~2K requests). The experiments started at 12pm and ended after ~20 Hrs, then at 8am the next day. One can clearly see from the graphs a reduction in CPU Load, memory consumption, disk activity and network I/O.

CPU Load

diagram 2

Disk I/O

diagram 3

Network In

diagram 4

Network Out

diagram 5


diagram 6

We see in the memory consumption graph that it is relatively flat. This is because we have not used any kind of memory caching in our media wiki application.

We calculated the mean response time for each sample level, and below is a graph of the same against number of requests that were served per hour.

diagram 7

To understand this graph, please refer to our post on Little’s Law here. The graph isn’t completely consistent with that post because the traffic has bursts which cause the queue to drain out, and Little’s Law does not apply. But we can see a clear inflection point in this graphs at 11326 requests per hour.

Post Processing the Result Log Files

We used a python script to extract the results, then created a per minute histogram. We have plotted the results using gnuplot. Here are the graphs of the same:

Number of Requests per Minute

diagram 8

Median Response Time

diagram 9

It is interesting to note that we do not see a difference in the median response time when the sampling factor is 999 and 995. We start seeing a slight increase when it is 990. But, we do see a relatively large difference when the sampling factor is 985 and 980, which also corresponds with the number of missed requests.

Mean Response Time

diagram 10

Although mean response does not give an accurate picture because of high variance, it is still useful. With it, we can get an approximate value of the response times. It would be interesting to run the same experiments on different AWS instance types and compare the results.

To find out more about benchmarking activities being done by Flux7, visit our website at Or, email us at


May 27, 2014 / Benchmarking

About the Author

Flux7 Labs
Find me on: