You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Product type Paperback

Published in Feb 2020

Publisher Packt

ISBN-13 9781839213106

Length 626 pages

Edition 2nd Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Theodore Petrou

Matthew Harrison

View More author details

Table of Contents (17) Chapters

Preface

1. Pandas Foundations

2. Essential DataFrame Operations FREE CHAPTER

3. Creating and Persisting DataFrames

4. Beginning Data Analysis

5. Exploratory Data Analysis

6. Selecting Subsets of Data

7. Filtering Rows

8. Index Alignment

9. Grouping for Aggregation, Filtration, and Transformation

10. Restructuring Data into a Tidy Form

11. Combining Pandas Objects

12. Time Series Analysis

13. Visualization with Matplotlib, Pandas, and Seaborn

14. Debugging and Testing Pandas

15. Other Books You May Enjoy

16. Index

Determining college campus diversity

Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:

>>> pd.read_csv(
...     "data/college_diversity.csv", index_col="School"
... )
                                                   Diversity Index
School
Rutgers University--Newark  Newark, NJ                        0.76
Andrews University  Berrien Springs, MI                       0.74
Stanford University  Stanford, CA                             0.74
University of Houston  Houston, TX                            0.74
University of Nevada--Las Vegas  Las Vegas, NV                0.74
University of San Francisco  San Francisco, CA                0.74
San Francisco State University  San Francisco, CA             0.73
University of Illinois--Chicago  Chicago, IL                  0.73
New Jersey Institute of Technology  Newark, NJ                0.72
Texas Woman's University  Denton, TX                          0.72

Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.

How to do it...

Read in the college dataset, and filter for just the undergraduate race columns:

>>> college = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = college.filter(like="UGDS_")

Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:

>>> (
...     college_ugds.isnull()
...     .sum(axis="columns")
...     .sort_values(ascending=False)
...     .head()
... )
INSTNM
Excel Learning Center-San Antonio South         9
Philadelphia College of Osteopathic Medicine    9
Assemblies of God Theological Seminary          9
Episcopal Divinity School                       9
Phillips Graduate Institute                     9
dtype: int64

Now that we have seen the colleges that are missing all their race columns, we can use the .dropna method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:

>>> college_ugds = college_ugds.dropna(how="all")
>>>; college_ugds.isnull().sum()
UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method, .ge, to return a DataFrame with a Boolean value for each cell:

>>> college_ugds.ge(0.15)
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...       False        True  ...     False      False
Universit...        True        True  ...     False      False
Amridge U...        True        True  ...     False       True
Universit...        True       False  ...     False      False
Alabama S...       False        True  ...     False      False
...                  ...         ...  ...       ...        ...
Hollywood...        True        True  ...     False      False
Hollywood...       False        True  ...     False      False
Coachella...        True       False  ...     False      False
Dewey Uni...       False       False  ...     False      False
Coastal P...        True        True  ...     False      False

From here, we can use the .sum method to count the True values for each college. Notice that a Series is returned:

>>> diversity_metric = college_ugds.ge(0.15).sum(
...     axis="columns"
... )
>>> diversity_metric.head()
INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

To get an idea of the distribution, we will use the .value_counts method on this Series:

>>> diversity_metric.value_counts()
1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

Amazingly, two schools have more than 15% in five different race categories. Let's sort the diversity_metric Series to find out which ones they are:

>>> diversity_metric.sort_values(ascending=False).head()
INSTNM
Regency Beauty Institute-Austin          5
Central Texas Beauty College-Temple      5
Sullivan and Cogliano Training Center    4
Ambria College of Nursing                4
Berkeley College-New York                4
dtype: int64

It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use .loc to select rows based on the index label:

>>> college_ugds.loc[
...     [
...         "Regency Beauty Institute-Austin",
...         "Central Texas Beauty College-Temple",
...     ]
... ]
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...
Regency B...      0.1867      0.2133  ...       0.0     0.2667
Central T...      0.1616      0.2323  ...       0.0     0.1515

It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:

>>> us_news_top = [
...     "Rutgers University-Newark",
...     "Andrews University",
...     "Stanford University",
...     "University of Houston",
...     "University of Nevada-Las Vegas",
... ]
>>> diversity_metric.loc[us_news_top]
INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

How it works...

Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna method in step 3 has the how parameter, which defaults to the string 'any', but may also be changed to 'all'. When set to 'any', it drops rows that contain one or more missing values. When set to 'all', it only drops rows where all values are missing.

In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna method was performed. If there were still missing values, we could have run the .fillna(0) method to fill all the remaining values with 0.

Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'.

The .value_counts method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.

Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.

There's more...

Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:

>>> (
...     college_ugds.max(axis=1)
...     .sort_values(ascending=False)
...     .head(10)
... )
INSTNM
Dewey University-Manati                               1.0
Yeshiva and Kollel Harbotzas Torah                    1.0
Mr Leon's School of Hair Design-Lewiston              1.0
Dewey University-Bayamon                              1.0
Shepherds Theological Seminary                        1.0
Yeshiva Gedolah Kesser Torah                          1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
Yeshiva Shaar Hatorah                                 1.0
Bais Medrash Elyon                                    1.0
Yeshiva of Nitra Rabbinical College                   1.0
dtype: float64

We can also determine if any school has all nine race categories exceeding 1%:

>>> (college_ugds > 0.01).all(axis=1).any()
True

You're reading from Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Table of Contents (17) Chapters

Determining college campus diversity

How to do it...

How it works...

There's more...

Authors (2)

Other recommended products

Personalised recommendations for you