You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Product type Paperback

Published in Sep 2024

Publisher Packt

ISBN-13 9781837634743

Length 456 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Author (1):

Maria Zervou

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1: Upstream Data Ingestion and Cleaning

2. Chapter 1: Data Ingestion Techniques FREE CHAPTER

3. Chapter 2: Importance of Data Quality

4. Chapter 3: Data Profiling – Understanding Data Structure, Quality, and Distribution

5. Chapter 4: Cleaning Messy Data and Data Manipulation

6. Chapter 5: Data Transformation – Merging and Concatenating

7. Chapter 6: Data Grouping, Aggregation, Filtering, and Applying Functions

8. Chapter 7: Data Sinks

9. Part 2: Downstream Data Cleaning – Consuming Structured Data

10. Chapter 8: Detecting and Handling Missing Values and Outliers

11. Chapter 9: Normalization and Standardization

12. Chapter 10: Handling Categorical Features

13. Chapter 11: Consuming Time Series Data

14. Part 3: Downstream Data Cleaning – Consuming Unstructured Data

15. Chapter 12: Text Preprocessing in the Era of LLMs

16. Chapter 13: Image and Audio Preprocessing with LLMs

17. Index

Why subscribe?

18. Other Books You May Enjoy

Text cleaning

The primary goal of text cleaning is to transform unstructured textual information into a standardized and more manageable form. While cleaning text, several operations are commonly performed, such as the removal of HTML tags, special characters, and numerical values, as well as the standardization of letter cases and the handling of whitespaces and formatting issues. These operations collectively contribute to refining the quality of textual data and reducing its ambiguity. Let’s deep dive into these techniques.

Removing HTML tags and special characters

HTML tags are often present due to the extraction of content from web pages. These tags, such as <p>, <a>, or <div>, carry no semantic meaning in the context of NLP and must be removed. The cleaning process involves the identification and stripping of HTML tags, leaving behind only the actual words.

For this example, let’s consider a scenario where we have a dataset of user reviews...

The rest of the chapter is locked

You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Table of Contents (19) Chapters

Text cleaning

Removing HTML tags and special characters

Authors (1)

Personalised recommendations for you

You're reading from Python Data Cleaning and Preparation Best Practices A practical guide to organizing and handling data from various sources and formats using Python

Table of Contents (19) Chapters

Text cleaning

Removing HTML tags and special characters

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you