DataFrame operations
A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.
When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5
to each value of the DataFrame raises a TypeError
as integers cannot be added to strings:
>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
...
TypeError: can only concatenate str (not "int") to str
To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'
. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter
method:
>>> colleges = pd.read_csv(
... "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round
method.
How to do it...
- pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the
UGDS_BLACK
row of this series when we round it to two decimal places:>>> name = "Northwest-Shoals Community College" >>> college_ugds.loc[name] UGDS_WHITE 0.7912 UGDS_BLACK 0.1250 UGDS_HISP 0.0339 UGDS_ASIAN 0.0036 UGDS_AIAN 0.0088 UGDS_NHPI 0.0006 UGDS_2MOR 0.0012 UGDS_NRA 0.0033 UGDS_UNKN 0.0324 Name: Northwest-Shoals Community College, dtype: float64 >>> college_ugds.loc[name].round(2) UGDS_WHITE 0.79 UGDS_BLACK 0.12 UGDS_HISP 0.03 UGDS_ASIAN 0.00 UGDS_AIAN 0.01 UGDS_NHPI 0.00 UGDS_2MOR 0.00 UGDS_NRA 0.00 UGDS_UNKN 0.03 Name: Northwest-Shoals Community College, dtype: float64
If we add
.0001
before rounding, it changes to rounding up:>>> (college_ugds.loc[name] + 0.0001).round(2) UGDS_WHITE 0.79 UGDS_BLACK 0.13 UGDS_HISP 0.03 UGDS_ASIAN 0.00 UGDS_AIAN 0.01 UGDS_NHPI 0.00 UGDS_2MOR 0.00 UGDS_NRA 0.00 UGDS_UNKN 0.03 Name: Northwest-Shoals Community College, dtype: float64
- Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add
.00501
to each value ofcollege_ugds
:>>> college_ugds + 0.00501 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03831 0.94031 ... 0.01091 0.01881 Universit... 0.59721 0.26501 ... 0.02291 0.01501 Amridge U... 0.30401 0.42421 ... 0.00501 0.27651 Universit... 0.70381 0.13051 ... 0.03821 0.04001 Alabama S... 0.02081 0.92581 ... 0.02931 0.01871 ... ... ... ... ... ... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- Use the floor division operator,
//
, to round down to the nearest whole number percentage:>>> (college_ugds + 0.00501) // 0.01 UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 3.0 94.0 ... 1.0 1.0 Universit... 59.0 26.0 ... 2.0 1.0 Amridge U... 30.0 42.0 ... 0.0 27.0 Universit... 70.0 13.0 ... 3.0 4.0 Alabama S... 2.0 92.0 ... 2.0 1.0 ... ... ... ... ... ... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- To complete the rounding exercise, divide by
100
:>>> college_ugds_op_round =( ... (college_ugds + 0.00501) // 0.01 / 100 ... ) >>> college_ugds_op_round.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03 0.94 ... 0.01 0.01 Universit... 0.59 0.26 ... 0.02 0.01 Amridge U... 0.30 0.42 ... 0.00 0.27 Universit... 0.70 0.13 ... 0.03 0.04 Alabama S... 0.02 0.92 ... 0.02 0.01
- Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:
>>> college_ugds_round = (college_ugds + 0.00001).round(2) >>> college_ugds_round UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.03 0.94 ... 0.01 0.01 Universit... 0.59 0.26 ... 0.02 0.01 Amridge U... 0.30 0.42 ... 0.00 0.27 Universit... 0.70 0.13 ... 0.03 0.04 Alabama S... 0.02 0.92 ... 0.02 0.01 ... ... ... ... ... .... SAE Insti... NaN NaN ... NaN NaN Rasmussen... NaN NaN ... NaN NaN National ... NaN NaN ... NaN NaN Bay Area ... NaN NaN ... NaN NaN Excel Lea... NaN NaN ... NaN NaN
- Use the equals DataFrame method to test the equality of two DataFrames:
>>> college_ugds_op_round.equals(college_ugds_round) True
How it works...
Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.
Mathematically, adding .005
should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:
>>> 0.045 + 0.005
0.049999999999999996
There is an extra .00001
added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.
Step 3 applies the floor division operator, //
, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100
and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.
In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001
to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).
It is necessary here to round up so that both DataFrame values are equal. The .equals
method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.
There's more...
Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:
>>> college2 = (
... college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True