Computing the median in a large dataset
As you have seen in the first recipe, computing the median requires having all the values available. With something like a mean, we just need an accumulator and a counter. The fundamental point of this recipe is to introduce the idea of approximate computing; with big data, it may not always be the best strategy to get the precise value (of course, this should be evaluated on a case-by-case basis).
Getting ready
We will require the first recipe to have been fully run.
Here, we will take two different strategies to compute the median: approximating the data points in a way that allows compression of data and subsampling of data.
As usual, this is available in the 08_Advanced/Median.ipynb
notebook.
How to do it...
Take a look at the following steps:
Our first approach will be to use approximations of all values, starting with creating a dictionary. This code should be run where the first recipe was run:
from __future__ import division, print_function import...