Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Pandas 1.x Cookbook

You're reading from   Pandas 1.x Cookbook Practical recipes for scientific computing, time series analysis, and exploratory data analysis using Python

Arrow left icon
Product type Paperback
Published in Feb 2020
Publisher Packt
ISBN-13 9781839213106
Length 626 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Theodore Petrou Theodore Petrou
Author Profile Icon Theodore Petrou
Theodore Petrou
Matthew Harrison Matthew Harrison
Author Profile Icon Matthew Harrison
Matthew Harrison
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Pandas Foundations 2. Essential DataFrame Operations FREE CHAPTER 3. Creating and Persisting DataFrames 4. Beginning Data Analysis 5. Exploratory Data Analysis 6. Selecting Subsets of Data 7. Filtering Rows 8. Index Alignment 9. Grouping for Aggregation, Filtration, and Transformation 10. Restructuring Data into a Tidy Form 11. Combining Pandas Objects 12. Time Series Analysis 13. Visualization with Matplotlib, Pandas, and Seaborn 14. Debugging and Testing Pandas 15. Other Books You May Enjoy
16. Index

DataFrame operations

A primer on operators was given in the Series operations recipe from Chapter 1, Pandas Foundations, which will be helpful here. The Python arithmetic and comparison operators work with DataFrames, as they do with Series.

When an arithmetic or comparison operator is used with a DataFrame, each value of each column gets the operation applied to it. Typically, when an operator is used with a DataFrame, the columns are either all numeric or all object (usually strings). If the DataFrame does not contain homogeneous data, then the operation is likely to fail. Let's see an example of this failure with the college dataset, which contains both numeric and object data types. Attempting to add 5 to each value of the DataFrame raises a TypeError as integers cannot be added to strings:

>>> colleges = pd.read_csv("data/college.csv")
>>> colleges + 5
Traceback (most recent call last):
  ...
TypeError: can only concatenate str (not "int") to str

To successfully use an operator with a DataFrame, first select homogeneous data. For this recipe, we will select all the columns that begin with 'UGDS_'. These columns represent the fraction of undergraduate students by race. To get started, we import the data and use the institution name as the label for our index, and then select the columns we desire with the .filter method:

>>> colleges = pd.read_csv(
...     "data/college.csv", index_col="INSTNM"
... )
>>> college_ugds = colleges.filter(like="UGDS_")
>>> college_ugds.head()
              UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
INSTNM                                ...                     
Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
Universit...      0.5922      0.2600  ...    0.0179     0.0100
Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
Universit...      0.6988      0.1255  ...    0.0332     0.0350
Alabama S...      0.0158      0.9208  ...    0.0243     0.0137

This recipe uses multiple operators with a DataFrame to round the undergraduate columns to the nearest hundredth. We will then see how this result is equivalent to the .round method.

How to do it...

  1. pandas does bankers rounding, numbers that are exactly halfway between either side to the even side. Look at what happens to the UGDS_BLACK row of this series when we round it to two decimal places:
    >>> name = "Northwest-Shoals Community College"
    >>> college_ugds.loc[name]
    UGDS_WHITE    0.7912
    UGDS_BLACK    0.1250
    UGDS_HISP     0.0339
    UGDS_ASIAN    0.0036
    UGDS_AIAN     0.0088
    UGDS_NHPI     0.0006
    UGDS_2MOR     0.0012
    UGDS_NRA      0.0033
    UGDS_UNKN     0.0324
    Name: Northwest-Shoals Community College, dtype: float64
    >>> college_ugds.loc[name].round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.12
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    

    If we add .0001 before rounding, it changes to rounding up:

    >>> (college_ugds.loc[name] + 0.0001).round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.13
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    
  2. Let's do this to the DataFrame. To begin our rounding adventure with operators, we will first add .00501 to each value of college_ugds:
    >>> college_ugds + 0.00501
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
    Universit...     0.59721     0.26501  ...   0.02291    0.01501
    Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
    Universit...     0.70381     0.13051  ...   0.03821    0.04001
    Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  3. Use the floor division operator, //, to round down to the nearest whole number percentage:
    >>> (college_ugds + 0.00501) // 0.01
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...         3.0        94.0  ...       1.0        1.0
    Universit...        59.0        26.0  ...       2.0        1.0
    Amridge U...        30.0        42.0  ...       0.0       27.0
    Universit...        70.0        13.0  ...       3.0        4.0
    Alabama S...         2.0        92.0  ...       2.0        1.0
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  4. To complete the rounding exercise, divide by 100:
    >>> college_ugds_op_round =(
    ...     (college_ugds + 0.00501) // 0.01 / 100
    ... )
    >>> college_ugds_op_round.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    
  5. Now use the round DataFrame method to do the rounding automatically for us. Due to bankers rounding, we add a small fraction before rounding:
    >>> college_ugds_round = (college_ugds + 0.00001).round(2)
    >>> college_ugds_round
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    ...                  ...         ...  ...       ...        ....
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  6. Use the equals DataFrame method to test the equality of two DataFrames:
    >>> college_ugds_op_round.equals(college_ugds_round)
    True
    

How it works...

Steps 1 and 2 use the plus operator, which attempts to add a scalar value to each value of each column of the DataFrame. As the columns are all numeric, this operation works as expected. There are some missing values in each of the columns but they stay missing after the operation.

Mathematically, adding .005 should be enough so that the floor division in the next step correctly rounds to the nearest whole percentage. The trouble appears because of the inexactness of floating-point numbers:

>>> 0.045 + 0.005
0.049999999999999996

There is an extra .00001 added to each number to ensure that the floating-point representation has the first four digits the same as the actual value. This works because the maximum precision of all the points in the dataset is four decimal places.

Step 3 applies the floor division operator, //, to all the values in the DataFrame. As we are dividing by a fraction, in essence, it is multiplying each value by 100 and truncating any decimals. Parentheses are needed around the first part of the expression, as floor division has higher precedence than addition. Step 4 uses the division operator to return the decimal to the correct position.

In step 5, we reproduce the previous steps with the round method. Before we can do this, we must again add an extra .00001 to each DataFrame value for a different reason from step 2. NumPy and Python 3 round numbers that are exactly halfway between either side to the even number. The bankers rounding (or ties to even http://bit.ly/2x3V5TU) technique is not usually what is formally taught in schools. It does not consistently bias numbers to the higher side (http://bit.ly/2zhsPy8).

It is necessary here to round up so that both DataFrame values are equal. The .equals method determines if all the elements and indexes between two DataFrames are exactly the same and returns a Boolean.

There's more...

Just as with Series, DataFrames have method equivalents of the operators. You may replace the operators with their method equivalents:

>>> college2 = (
...     college_ugds.add(0.00501).floordiv(0.01).div(100)
... )
>>> college2.equals(college_ugds_op_round)
True
You have been reading a chapter from
Pandas 1.x Cookbook - Second Edition
Published in: Feb 2020
Publisher: Packt
ISBN-13: 9781839213106
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image