Determining college campus diversity
Many articles are written every year on the different aspects and impacts of diversity on college campuses. Various organizations have developed metrics attempting to measure diversity. US News is a leader in providing rankings for many different categories of colleges, with diversity being one of them. Their top 10 diverse colleges with Diversity Index are given as follows:
>>> pd.read_csv(
... "data/college_diversity.csv", index_col="School"
... )
Diversity Index
School
Rutgers University--Newark Newark, NJ 0.76
Andrews University Berrien Springs, MI 0.74
Stanford University Stanford, CA 0.74
University of Houston Houston, TX 0.74
University of Nevada--Las Vegas Las Vegas, NV 0.74
University of San Francisco San Francisco, CA 0.74
San Francisco State University San Francisco, CA 0.73
University of Illinois--Chicago Chicago, IL 0.73
New Jersey Institute of Technology Newark, NJ 0.72
Texas Woman's University Denton, TX 0.72
Our college dataset classifies race into nine different categories. When trying to quantify something without an obvious definition, such as diversity, it helps to start with something simple. In this recipe, our diversity metric will equal the count of the number of races having greater than 15% of the student population.
How to do it...
- Read in the college dataset, and filter for just the undergraduate race columns:
>>> college = pd.read_csv( ... "data/college.csv", index_col="INSTNM" ... ) >>> college_ugds = college.filter(like="UGDS_")
- Many of these colleges have missing values for all their race columns. We can count all the missing values for each row and sort the resulting Series from the highest to lowest. This will reveal the colleges that have missing values:
>>> ( ... college_ugds.isnull() ... .sum(axis="columns") ... .sort_values(ascending=False) ... .head() ... ) INSTNM Excel Learning Center-San Antonio South 9 Philadelphia College of Osteopathic Medicine 9 Assemblies of God Theological Seminary 9 Episcopal Divinity School 9 Phillips Graduate Institute 9 dtype: int64
- Now that we have seen the colleges that are missing all their race columns, we can use the
.dropna
method to drop all rows that have all nine race percentages missing. We can then count the remaining missing values:>>> college_ugds = college_ugds.dropna(how="all") >>>; college_ugds.isnull().sum() UGDS_WHITE 0 UGDS_BLACK 0 UGDS_HISP 0 UGDS_ASIAN 0 UGDS_AIAN 0 UGDS_NHPI 0 UGDS_2MOR 0 UGDS_NRA 0 UGDS_UNKN 0 dtype: int64
- There are no missing values left in the dataset. We can now calculate our diversity metric. To get started, we will use the greater than or equal DataFrame method,
.ge
, to return a DataFrame with a Boolean value for each cell:>>> college_ugds.ge(0.15) UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Alabama A... False True ... False False Universit... True True ... False False Amridge U... True True ... False True Universit... True False ... False False Alabama S... False True ... False False ... ... ... ... ... ... Hollywood... True True ... False False Hollywood... False True ... False False Coachella... True False ... False False Dewey Uni... False False ... False False Coastal P... True True ... False False
- From here, we can use the
.sum
method to count theTrue
values for each college. Notice that a Series is returned:>>> diversity_metric = college_ugds.ge(0.15).sum( ... axis="columns" ... ) >>> diversity_metric.head() INSTNM Alabama A & M University 1 University of Alabama at Birmingham 2 Amridge University 3 University of Alabama in Huntsville 1 Alabama State University 1 dtype: int64
- To get an idea of the distribution, we will use the
.value_counts
method on this Series:>>> diversity_metric.value_counts() 1 3042 2 2884 3 876 4 63 0 7 5 2 dtype: int64
- Amazingly, two schools have more than 15% in five different race categories. Let's sort the
diversity_metric
Series to find out which ones they are:>>> diversity_metric.sort_values(ascending=False).head() INSTNM Regency Beauty Institute-Austin 5 Central Texas Beauty College-Temple 5 Sullivan and Cogliano Training Center 4 Ambria College of Nursing 4 Berkeley College-New York 4 dtype: int64
- It seems a little suspicious that schools can be that diverse. Let's look at the raw percentages from these top two schools. We will use
.loc
to select rows based on the index label:>>> college_ugds.loc[ ... [ ... "Regency Beauty Institute-Austin", ... "Central Texas Beauty College-Temple", ... ] ... ] UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN INSTNM ... Regency B... 0.1867 0.2133 ... 0.0 0.2667 Central T... 0.1616 0.2323 ... 0.0 0.1515
- It appears that several categories were aggregated into the unknown and two or more races column. Regardless of this, they both appear to be quite diverse. We can see how the top five US News schools fared with this basic diversity metric:
>>> us_news_top = [ ... "Rutgers University-Newark", ... "Andrews University", ... "Stanford University", ... "University of Houston", ... "University of Nevada-Las Vegas", ... ] >>> diversity_metric.loc[us_news_top] INSTNM Rutgers University-Newark 4 Andrews University 3 Stanford University 3 University of Houston 3 University of Nevada-Las Vegas 3 dtype: int64
How it works...
Step 2 counts and then displays the schools with the highest number of missing values. As there are nine columns in the DataFrame, the maximum number of missing values per school is nine. Many schools are missing values for each column. Step 3 removes rows that have all their values missing. The .dropna
method in step 3 has the how
parameter, which defaults to the string 'any'
, but may also be changed to 'all'
. When set to 'any'
, it drops rows that contain one or more missing values. When set to 'all'
, it only drops rows where all values are missing.
In this case, we conservatively drop rows that are missing all values. This is because it's possible that some missing values represent 0 percent. This did not happen to be the case here, as there were no missing values after the dropna
method was performed. If there were still missing values, we could have run the .fillna(0)
method to fill all the remaining values with 0.
Step 5 begins our diversity metric calculation using the greater than or equal to method, .ge
. This results in a DataFrame of all Booleans, which is summed horizontally by setting axis='columns'
.
The .value_counts
method is used in step 6 to produce a distribution of our diversity metric. It is quite rare for schools to have three races with 15% or more of the undergraduate student population. Step 7 and step 8 find two schools that are the most diverse based on our metric. Although they are diverse, it appears that many of the races are not fully accounted for and are defaulted into the unknown and two or more categories.
Step 9 selects the top five schools from the US News article. It then selects their diversity metric from our newly created Series. It turns out that these schools also score highly with our simple ranking system.
There's more...
Alternatively, we can find the schools that are least diverse by ordering them by their maximum race percentage:
>>> (
... college_ugds.max(axis=1)
... .sort_values(ascending=False)
... .head(10)
... )
INSTNM
Dewey University-Manati 1.0
Yeshiva and Kollel Harbotzas Torah 1.0
Mr Leon's School of Hair Design-Lewiston 1.0
Dewey University-Bayamon 1.0
Shepherds Theological Seminary 1.0
Yeshiva Gedolah Kesser Torah 1.0
Monteclaro Escuela de Hoteleria y Artes Culinarias 1.0
Yeshiva Shaar Hatorah 1.0
Bais Medrash Elyon 1.0
Yeshiva of Nitra Rabbinical College 1.0
dtype: float64
We can also determine if any school has all nine race categories exceeding 1%:
>>> (college_ugds > 0.01).all(axis=1).any()
True