Book Image

Python Data Analysis - Third Edition

By : Avinash Navlani, Ivan Idris

5 (1)

Book Image

Python Data Analysis - Third Edition

5 (1)

By: Avinash Navlani, Ivan Idris

Overview of this book

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

Preface

Who this book is for

What this book covers

To get the most out of this book

Section 1: Foundation for Data Analysis

Section 1: Foundation for Data Analysis

Free Chapter

Getting Started with Python Libraries

Getting Started with Python Libraries

Understanding data analysis

The standard process of data analysis

The KDD process

Comparing data analysis and data science

The skillsets of data analysts and data scientists

Installing Python 3

Software used in this book

Using IPython as a shell

Using JupyterLab

Using Jupyter Notebooks

Advanced features of Jupyter Notebooks

NumPy and pandas

NumPy and pandas

Technical requirements

Understanding NumPy arrays

NumPy array numerical data types

Manipulating array shapes

The stacking of NumPy arrays

Partitioning NumPy arrays

Changing the data type of NumPy arrays

Creating NumPy views and copies

Slicing NumPy arrays

Boolean and fancy indexing

Broadcasting arrays

Creating pandas DataFrames

Understanding pandas Series

Reading and querying the Quandl data

Describing pandas DataFrames

Grouping and joining pandas DataFrame

Working with missing values

Creating pivot tables

Dealing with dates

Statistics

Technical requirements

Understanding attributes and their types

Measuring central tendency

Measuring dispersion

Skewness and kurtosis

Understanding relationships using covariance and correlation coefficients

Central limit theorem

Collecting samples

Performing parametric tests

Performing non-parametric tests

Linear Algebra

Technical requirements

Fitting to polynomials with NumPy

Finding the rank of a matrix

Matrix inverse using NumPy

Solving linear equations using NumPy

Decomposing a matrix using SVD

Eigenvectors and Eigenvalues using NumPy

Generating random numbers

Binomial distribution

Normal distribution

Testing normality of data using SciPy

Creating a masked array using the numpy.ma subpackage

Section 2: Exploratory Data Analysis and Data Cleaning

Section 2: Exploratory Data Analysis and Data Cleaning

Data Visualization

Data Visualization

Technical requirements

Visualization using Matplotlib

Advanced visualization using the Seaborn package

Interactive visualization with Bokeh

Retrieving, Processing, and Storing Data

Retrieving, Processing, and Storing Data

Technical requirements

Reading and writing CSV files with NumPy

Reading and writing CSV files with pandas

Reading and writing data from Excel

Reading and writing data from JSON

Reading and writing data from HDF5

Reading and writing data from HTML tables

Reading and writing data from Parquet

Reading and writing data from a pickle pandas object

Lightweight access with sqllite3

Reading and writing data from MySQL

Reading and writing data from MongoDB

Reading and writing data from Cassandra

Reading and writing data from Redis

Cleaning Messy Data

Cleaning Messy Data

Technical requirements

Filtering data to weed out the noise

Handling missing values

Handling outliers

Feature encoding techniques

Feature scaling

Feature transformation

Feature splitting

Signal Processing and Time Series

Signal Processing and Time Series

Technical requirements

The statsmodels modules

Moving averages

Window functions

Defining cointegration

STL decomposition

Autocorrelation

Autoregressive models

Generating periodic signals

Fourier analysis

Spectral analysis filtering

Section 3: Deep Dive into Machine Learning

Section 3: Deep Dive into Machine Learning

Supervised Learning - Regression Analysis

Supervised Learning - Regression Analysis

Technical requirements

Linear regression

Understanding multicollinearity

Dummy variables

Developing a linear regression model

Evaluating regression model performance

Fitting polynomial regression

Regression models for classification

Logistic regression

Implementing logistic regression using scikit-learn

Supervised Learning - Classification Techniques

Supervised Learning - Classification Techniques

Technical requirements

Naive Bayes classification

Decision tree classification

KNN classification

SVM classification

Splitting training and testing sets

Evaluating the classification model performance

ROC curve and AUC

Unsupervised Learning - PCA and Clustering

Unsupervised Learning - PCA and Clustering

Technical requirements

Unsupervised learning

Reducing the dimensionality of data

Partitioning data using k-means clustering

Hierarchical clustering

DBSCAN clustering

Spectral clustering

Evaluating clustering performance

Section 4: NLP, Image Analytics, and Parallel Computing

Section 4: NLP, Image Analytics, and Parallel Computing

Analyzing Textual Data

Analyzing Textual Data

Technical requirements

Installing NLTK and SpaCy

Text normalization

Removing stopwords

Stemming and lemmatization

Recognizing entities

Dependency parsing

Creating a word cloud

Sentiment analysis using text classification

Text similarity

Analyzing Image Data

Analyzing Image Data

Technical requirements

Installing OpenCV

Understanding image data

Drawing on images

Writing on images

Resizing images

Flipping images

Changing the brightness

Blurring an image

Parallel Computing Using Dask

Parallel Computing Using Dask

Parallel computing using Dask

Dask data types

Preprocessing data at scale

Machine learning at scale

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Splitting training and testing sets

Data scientists need to assess the performance of a model, overcome overfitting, and tune the hyperparameters. All these tasks require some hidden data records that were not used in the model development phase. Before model development, the data needs to be divided into some parts, such as train, test, and validation sets. The training dataset is used to build the model. The test dataset is used to assess the performance of a model that was trained on the train set. The validation set is used to find the hyperparameters. Let's look at the following strategies for the train-test split in the upcoming subsections:

Holdout method
K-fold cross-validation
Bootstrap method

Holdout

In this method, the dataset is divided randomly into two parts: a training and testing set. Generally, this ratio is 2:1, which means 2/3 for training and 1/3 for testing. We can also split it into different ratios, such as 6:4, 7:3, and 8:2:

# partition data into training...