Transposing the direction of a DataFrame operation
Many DataFrame methods have an axis
parameter. This parameter controls the direction in which the operation takes place. Axis parameters can be 'index'
(or 0
) or 'columns'
(or 1
). I prefer the string versions are they are more explicit and tend to make the code easier to read.
Nearly all DataFrame methods default the axis parameter to 0
, which applies to operations along the index. This recipe shows you how to invoke the same method along both axes.
How to do it...
- Read in the college dataset; the columns that begin with UGDS represent the percentage of the undergraduate students of a particular race. Use the filter method to select these columns:
>>> college = pd.read_csv( ... "data/college.csv", index_col="INSTNM" ... ) >>> college_ugds = college.filter(like="UGDS_") >>> college_ugds.head() UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... 0.0333 0.9353 ... 0.0059 0.0138 Universit... 0.5922 0.2600 ... 0.0179 0.0100 Amridge U... 0.2990 0.4192 ... 0.0000 0.2715 Universit... 0.6988 0.1255 ... 0.0332 0.0350 Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
- Now that the DataFrame contains homogenous column data, operations can be sensibly done both vertically and horizontally. The
.count
method returns the number of non-missing values. By default, itsaxis
parameter is set to0
:>>> college_ugds.count() UGDS_WHITE 6874 UGDS_BLACK 6874 UGDS_HISP 6874 UGDS_ASIAN 6874 UGDS_AIAN 6874 UGDS_NHPI 6874 UGDS_2MOR 6874 UGDS_NRA 6874 UGDS_UNKN 6874 dtype: int64
The
axis
parameter is almost always set to0
. So, step 2 is equivalent to bothcollege_ugds.count(axis=0)
andcollege_ugds.count(axis='index')
. - Changing the axis parameter to
'columns'
changes the direction of the operation so that we get back a count of non-missing items in each row:>>> college_ugds.count(axis="columns").head() INSTNM Alabama A & M University 9 University of Alabama at Birmingham 9 Amridge University 9 University of Alabama in Huntsville 9 Alabama State University 9 dtype: int64
- Instead of counting non-missing values, we can sum all the values in each row. Each row of percentages should add up to 1. The
.sum
method may be used to verify this:>>> college_ugds.sum(axis="columns").head() INSTNM Alabama A & M University 1.0000 University of Alabama at Birmingham 0.9999 Amridge University 1.0000 University of Alabama in Huntsville 1.0000 Alabama State University 1.0000 dtype: float64
- To get an idea of the distribution of each column, the
.median
method can be used:>>> college_ugds.median(axis="index") UGDS_WHITE 0.55570 UGDS_BLACK 0.10005 UGDS_HISP 0.07140 UGDS_ASIAN 0.01290 UGDS_AIAN 0.00260 UGDS_NHPI 0.00000 UGDS_2MOR 0.01750 UGDS_NRA 0.00000 UGDS_UNKN 0.01430 dtype: float64
How it works...
The direction of operation on the axis is one of the more confusing aspects of pandas. Many pandas users have difficulty remembering the meaning of the axis parameter. I remember them by reminding myself that a Series only has one axis, the index (or 0). A DataFrame also has an index (axis 0) and columns (axis 1).
There's more...
The .cumsum
method with axis=1
accumulates the race percentages across each row. It gives a slightly different view of the data. For example, it is very easy to see the exact percentage of white and black students for each school:
>>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
>>> college_ugds_cumsum.head()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9686 ... 0.9862 1.0000
Universit... 0.5922 0.8522 ... 0.9899 0.9999
Amridge U... 0.2990 0.7182 ... 0.7285 1.0000
Universit... 0.6988 0.8243 ... 0.9650 1.0000
Alabama S... 0.0158 0.9366 ... 0.9863 1.0000