You're reading from Python Data Cleaning Cookbook Modern techniques and Python tools to detect and remove dirty data and extract key insights

Product type Paperback

Published in Dec 2020

Publisher Packt

ISBN-13 9781800565661

Length 436 pages

Edition 1st Edition

Languages

Python

Tools

Pandas

Concepts

Data Analysis

Authors (2):

Michael B Walker

Michael Walker

View More author details

Table of Contents (12) Chapters

Preface

1. Chapter 1: Anticipating Data Cleaning Issues when Importing Tabular Data into pandas

2. Chapter 2: Anticipating Data Cleaning Issues when Importing HTML and JSON into pandas FREE CHAPTER

3. Chapter 3: Taking the Measure of Your Data

4. Chapter 4: Identifying Missing Values and Outliers in Subsets of Data

5. Chapter 5: Using Visualizations for the Identification of Unexpected Values

6. Chapter 6: Cleaning and Exploring Data with Series Operations

7. Chapter 7: Fixing Messy Data when Aggregating

8. Chapter 8: Addressing Data Issues When Combining DataFrames

9. Chapter 9: Tidying and Reshaping Data

10. Chapter 10: User-Defined Functions and Classes to Automate Data Cleaning

11. Other Books You May Enjoy

Leave a review - let other readers know what you think

Removing duplicated rows

There are several reasons why we might have data duplicated at the unit of analysis:

The existing DataFrame may be the result of a one-to-many merge, and the one side is the unit of analysis.
The DataFrame is repeated measures or panel data collapsed into a flat file, which is just a special case of the first situation.
We may be working with an analysis file where multiple one-to-many relationships have been flattened, creating many-to-many relationships.

When the one side is the unit of analysis, data on the many side may need to be collapsed in some way. For example, if we are analyzing outcomes for a cohort of students at a college, students are the unit of analysis; but we may also have course enrollment data for each student. To prepare the data for analysis, we might need to first count the number of courses, sum the total credits, or calculate the GPA for each student, before ending up with one row per student. To generalize from...