Book Image

Python Data Analysis - Third Edition

By : Avinash Navlani, Ivan Idris
5 (1)
Book Image

Python Data Analysis - Third Edition

5 (1)
By: Avinash Navlani, Ivan Idris

Overview of this book

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.
Table of Contents (20 chapters)
1
Section 1: Foundation for Data Analysis
6
Section 2: Exploratory Data Analysis and Data Cleaning
11
Section 3: Deep Dive into Machine Learning
15
Section 4: NLP, Image Analytics, and Parallel Computing

The KDD process

The KDD acronym stands for knowledge discovery from data or Knowledge Discovery in Databases. Many people treat KDD as one synonym for data mining. Data mining is referred to as the knowledge discovery process of interesting patterns. The main objective of KDD is to extract or discover hidden interesting patterns from large databases, data warehouses, and other web and information repositories. The KDD process has seven major phases:

  1. Data Cleaning: In this first phase, data is preprocessed. Here, noise is removed, missing values are handled, and outliers are detected.
  2. Data Integration: In this phase, data from different sources is combined and integrated together using data migration and ETL tools.
  3. Data Selection: In this phase, relevant data for the analysis task is recollected.
  1. Data Transformation: In this phase, data is engineered in the required appropriate form for analysis.
  2. Data Mining: In this phase, data mining techniques are used to discover useful and unknown patterns.
  3. Pattern Evaluation: In this phase, the extracted patterns are evaluated.
  4. Knowledge Presentation: After pattern evaluation, the extracted knowledge needs to be visualized and presented to business people for decision-making purposes.

The complete KDD process is shown in the following diagram:

KDD is an iterative process for enhancing data quality, integration, and transformation to get a more improved system. Now, let's discuss the SEMMA process.