Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Python Data Analysis
Python Data Analysis

Python Data Analysis: Perform data collection, data processing, wrangling, visualization, and model building using Python , Third Edition

Arrow left icon
Profile Icon Navlani Profile Icon Idris
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (13 Ratings)
Paperback Feb 2021 478 pages 3rd Edition
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Navlani Profile Icon Idris
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5 (13 Ratings)
Paperback Feb 2021 478 pages 3rd Edition
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Python Data Analysis

Getting Started with Python Libraries

As you already know, Python has become one of the most popular, standard languages and is a complete package for data science-based operations. Python offers numerous libraries, such as NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib, Seaborn, and Plotly. These libraries provide a complete ecosystem for data analysis that is used by data analysts, data scientists, and business analysts. Python also offers other features, such as flexibility, being easy to learn, faster development, a large active community, and the ability to work on complex numeric, scientific, and research applications. All these features make it the first choice for data analysis.

In this chapter, we will focus on various data analysis processes, such as KDD, SEMMA, and CRISP-DM. After this, we will provide a comparison between data analysis and data science, as well as the roles and different skillsets for data analysts and data scientists. Finally, we will shift our focus and start installing various Python libraries, IPython, Jupyter Lab, and Jupyter Notebook. We will also look at various advanced features of Jupyter Notebooks.

In this introductory chapter, we will cover the following topics:

  • Understanding data analysis
  • The standard process of data analysis
  • The KDD process
  • SEMMA
  • CRISP-DM
  • Comparing data analysis and data science
  • The skillsets of data analysts and data scientists
  • Installing Python 3
  • Software used in this book
  • Using IPython as a shell
  • Using Jupyter Lab
  • Using Jupyter Notebooks
  • Advanced features of Jupyter Notebooks

Let's get started!

Understanding data analysis

The 21st century is the century of information. We are living in the age of information, which means that almost every aspect of our daily life is generating data. Not only this, but business operations, government operations, and social posts are also generating huge data. This data is accumulating day by day due to data being continually generated from business, government, scientific, engineering, health, social, climate, and environmental activities. In all these domains of decision-making, we need a systematic, generalized, effective, and flexible system for the analytical and scientific process so that we can gain insights into the data that is being generated.

In today's smart world, data analysis offers an effective decision-making process for business and government operations. Data analysis is the activity of inspecting, pre-processing, exploring, describing, and visualizing the given dataset. The main objective of the data analysis process is to discover the required information for decision-making. Data analysis offers multiple approaches, tools, and techniques, all of which can be applied to diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

  • NumPy: This is a short form of numerical Python. It is the most powerful scientific library available in Python for handling multidimensional arrays, matrices, and methods in order to compute mathematics efficiently.
  • SciPy: This is also a powerful scientific computing library for performing scientific, mathematical, and engineering operations.
  • Pandas: This is a data exploration and manipulation library that offers tabular data structures such as DataFrames and various methods for data analysis and manipulation.
  • Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a machine learning library that offers a variety of supervised and unsupervised algorithms, such as regression, classification, dimensionality reduction, cluster analysis, and anomaly detection.
  • Matplotlib: This is a core data visualization library and is the base library for all other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts, and figures for data exploration. It runs on top of NumPy and SciPy.
  • Seaborn: This is based on Matplotlib and offers easy to draw, high-level, interactive, and more organized plots.
  • Plotly: Plotly is a data visualization library. It offers high quality and interactive graphs, such as scatter charts, line charts, bar charts, histograms, boxplots, heatmaps, and subplots.

Installation instructions for the required libraries and software will be provided throughout this book when they're needed. In the meantime, let's discuss various data analysis processes, such as the standard process, KDD, SEMMA, and CRISP-DM.

The standard process of data analysis

Data analysis refers to investigating the data, finding meaningful insights from it, and drawing conclusions. The main goal of this process is to collect, filter, clean, transform, explore, describe, visualize, and communicate the insights from this data to discover decision-making information. Generally, the data analysis process is comprised of the following phases:

  1. Collecting Data: Collect and gather data from several sources.
  2. Preprocessing Data: Filter, clean, and transform the data into the required format.
  3. Analyzing and Finding Insights: Explore, describe, and visualize the data and find insights and conclusions.
  4. Insights Interpretations: Understand the insights and find the impact each variable has on the system.
  5. Storytelling: Communicate your results in the form of a story so that a layman can understand them.

We can summarize these steps of the data analysis process via the following process diagram:

In this section, we have covered the standard data analysis process, which emphasizes finding interpretable insights and converting them into a user story. In the next section, we will discuss the KDD process.

The KDD process

The KDD acronym stands for knowledge discovery from data or Knowledge Discovery in Databases. Many people treat KDD as one synonym for data mining. Data mining is referred to as the knowledge discovery process of interesting patterns. The main objective of KDD is to extract or discover hidden interesting patterns from large databases, data warehouses, and other web and information repositories. The KDD process has seven major phases:

  1. Data Cleaning: In this first phase, data is preprocessed. Here, noise is removed, missing values are handled, and outliers are detected.
  2. Data Integration: In this phase, data from different sources is combined and integrated together using data migration and ETL tools.
  3. Data Selection: In this phase, relevant data for the analysis task is recollected.
  1. Data Transformation: In this phase, data is engineered in the required appropriate form for analysis.
  2. Data Mining: In this phase, data mining techniques are used to discover useful and unknown patterns.
  3. Pattern Evaluation: In this phase, the extracted patterns are evaluated.
  4. Knowledge Presentation: After pattern evaluation, the extracted knowledge needs to be visualized and presented to business people for decision-making purposes.

The complete KDD process is shown in the following diagram:

KDD is an iterative process for enhancing data quality, integration, and transformation to get a more improved system. Now, let's discuss the SEMMA process.

SEMMA

The SEMMA acronym's full form is Sample, Explore, Modify, Model, and Assess. This sequential data mining process is developed by SAS. The SEMMA process has five major phases:

  1. Sample: In this phase, we identify different databases and merge them. After this, we select the data sample that's sufficient for the modeling process.
  2. Explore: In this phase, we understand the data, discover the relationships among variables, visualize the data, and get initial interpretations.
  3. Modify: In this phase, data is prepared for modeling. This phase involves dealing with missing values, detecting outliers, transforming features, and creating new additional features.
  4. Model: In this phase, the main concern is selecting and applying different modeling techniques, such as linear and logistic regression, backpropagation networks, KNN, support vector machines, decision trees, and Random Forest.
  5. Assess: In this last phase, the predictive models that have been developed are evaluated using performance evaluation measures.

The following diagram shows this process:

The preceding diagram shows the steps involved in the SEMMA process. SEMMA emphasizes model building and assessment. Now, let's discuss the CRISP-DM process.

CRISP-DM

CRISP-DM's full form is CRoss-InduStry Process for Data Mining. CRISP-DM is a well-defined, well-structured, and well-proven process for machine learning, data mining, and business intelligence projects. It is a robust, flexible, cyclic, useful, and practical approach to solving business problems. The process discovers hidden valuable information or patterns from several databases. The CRISP-DM process has six major phases:

  1. Business Understanding: In this first phase, the main objective is to understand the business scenario and requirements for designing an analytical goal and initial action plan.
  2. Data Understanding: In this phase, the main objective is to understand the data and its collection process, perform data quality checks, and gain initial insights.
  3. Data Preparation: In this phase, the main objective is to prepare analytics-ready data. This involves handling missing values, outlier detection and handling, normalizing data, and feature engineering. This phase is the most time-consuming for data scientists/analysts.
  4. Modeling: This is the most exciting phase of the whole process since this is where you design the model for prediction purposes. First, the analyst needs to decide on the modeling technique and develop models based on data.
  5. Evaluation: Once the model has been developed, it's time to assess and test the model's performance on validation and test data using model evaluation measures such as MSE, RMSE, R-Square for regression and accuracy, precision, recall, and the F1-measure.
  6. Deployment: In this final phase, the model that was chosen in the previous step will be deployed to the production environment. This requires a team effort from data scientists, software developers, DevOps experts, and business professionals.

The following diagram shows the full cycle of the CRISP-DM process:

The standard process focuses on discovering insights and making interpretations in the form of a story, while KDD focuses on data-driven pattern discovery and visualizing this. SEMMA majorly focuses on model building tasks, while CRISP-DM focuses on business understanding and deployment. Now that we know about some of the processes surrounding data analysis, let's compare data analysis and data science to find out how they are related, as well as what makes them different from one other.

Comparing data analysis and data science

Data analysis is the process in which data is explored in order to discover patterns that help us make business decisions. It is one of the subdomains of data science. Data analysis methods and tools are widely utilized in several business domains by business analysts, data scientists, and researchers. Its main objective is to improve productivity and profits. Data analysis extracts and queries data from different sources, performs exploratory data analysis, visualizes data, prepares reports, and presents it to the business decision-making authorities.

On the other hand, data science is an interdisciplinary area that uses a scientific approach to extract insights from structured and unstructured data. Data science is a union of all terms, including data analytics, data mining, machine learning, and other related domains. Data science is not only limited to exploratory data analysis and is used for developing models and prediction algorithms such as stock price, weather, disease, fraud forecasts, and recommendations such as movie, book, and music recommendations.

The roles of data analysts and data scientists

A data analyst collects, filters, processes, and applies the required statistical concepts to capture patterns, trends, and insights from data and prepare reports for making decisions. The main objective of the data analyst is to help companies solve business problems using discovered patterns and trends. The data analyst also assesses the quality of the data and handles the issues concerning data acquisition. A data analyst should be proficient in writing SQL queries, finding patterns, using visualization tools, and using reporting tools Microsoft Power BI, IBM Cognos, Tableau, QlikView, Oracle BI, and more.

Data scientists are more technical and mathematical than data analysts. Data scientists are research- and academic-oriented, whereas data analysts are more application-oriented. Data scientists are expected to predict a future event, whereas data analysts extract significant insights out of data. Data scientists develop their own questions, while data analysts find answers to given questions. Finally, data scientists focus on what is going to happen, whereas data analysts focus on what has happened so far. We can summarize these two roles using the following table:

Features

Data Scientist

Data Analyst

Background

Predict future events and scenarios based on data

Discover meaningful insights from the data.

Role

Formulate questions that can profit the business

Solve the business questions to make decisions.

Type of data

Work on both structured and unstructured data

Only work on structured data

Programming

Advanced programming

Basic programming

Skillset

Knowledge of statistics, machine learning algorithms, NLP, and deep learning

Knowledge of statistics, SQL, and data visualization

Tools

R, Python, SAS, Hadoop, Spark, TensorFlow, and Keras

Excel, SQL, R, Tableau, and QlikView

Now that we know what defines a data analyst and data scientist, as well as how they are different from each other, let's have a look at the various skills that you would need to become one of them.

The skillsets of data analysts and data scientists

A data analyst is someone who discovers insights from data and creates value out of it. This helps decision-makers understand how the business is performing. Data analysts must acquire the following skills:

  • Exploratory Data Analysis (EDA): EDA is an essential skill for data analysts. It helps with inspecting data to discover patterns, test hypotheses, and assure assumptions.
  • Relational Database: Knowledge of at least one of the relational database tools, such as MySQL or Postgre, is mandatory. SQL is a must for working on relational databases.
  • Visualization and BI Tools: A picture speaks more than words. Visuals have more of an impact on humans and visuals are a clear and easy option for representing the insights. Visualization and BI tools such as Tableau, QlikView, MS Power BI, and IBM Cognos can help analysts visualize and prepare reports.
  • Spreadsheet: Knowledge of MS Excel, WPS, Libra, or Google Sheets is mandatory for storing and managing data in tabular form.
  • Storytelling and Presentation Skills: The art of storytelling is another necessary skill. A data analyst should be an expert in connecting data facts to an idea or an incident and turning it into a story.

On the other hand, the primary job of a data scientist is to solve problems using data. In order to do this, they need to understand the client's requirements, their domain, their problem space, and ensure that they get exactly what they really want. The tasks that data scientists undertake vary from company to company. Some companies use data analysts and offer the title of data scientist just to glorify the job designation. Some combine data analyst tasks with data engineers and offer data scientists designation; others assign them to machine learning-intensive tasks with data visualizations.

The task of the data scientist varies, depending on the company. Some employ data scientists as well-known data analysts and combine their responsibilities with data engineers. Others give them the task of performing intensive data visualization on machines.

A data scientist has to be a jack of all trades and wear multiple hats, including those of a data analyst, statistician, mathematician, programmer, ML, or NLP engineer. Most people are not skilled enough or experts in all these trades. Also, getting skilled enough requires lots of effort and patience. This is why data science cannot be learned in 3 or 6 months. Learning data science is a journey. A data scientist should have a wide variety of skills, such as the following:

  • Mathematics and Statistics: Most machine learning algorithms are based on mathematics and statistics. Knowledge of mathematics helps data scientists develop custom solutions.
  • Databases: Knowledge of SQL allows data scientists to interact with the database and collect the data for prediction and recommendation.
  • Machine Learning: Knowledge of supervised machine learning techniques such as regression analysis, classification techniques, and unsupervised machine learning techniques such as cluster analysis, outlier detection, and dimensionality reduction.
  • Programming Skills: Knowledge of programming helps data scientists automate their suggested solutions. Knowledge of Python and R is recommended.
  • Storytelling and Presentation skills: Communicating the results in the form of storytelling via PowerPoint presentations.
  • Big Data Technology: Knowledge of big data platforms such as Hadoop and Spark helps data scientists develop big data solutions for large-scale enterprises.
  • Deep Learning Tools: Deep learning tools such as Tensorflow and Keras are utilized in NLP and image analytics.

Apart from these skillsets, knowledge of web scraping packages/tools for extracting data from diverse sources and web application frameworks such as Flask or Django for designing prototype solutions is also obtained. It is all about the skillset for data science professionals.

Now that we have covered the basics of data analysis and data science, let's dive into the basic setup needed to get started with data analysis. In the next section, we'll learn how to install Python.

Installing Python 3

The installer file for installing Python 3 can easily be downloaded from the official website (https://www.python.org/downloads/) for Windows, Linux, and Mac 32-bit or 64-bit systems. The installer can be installed by double-clicking on it. This installer also has an IDE named "IDLE" that can be used for development. We will dive deeper into each of the operating systems in the next few sections.

Python installation and setup on Windows

This book is based on the latest Python 3 version. All the code that will be used in this book is written in Python 3, so we need to install Python 3 before we can start coding. Python is an open source, distributed, and freely available language. It is also licensed for commercial use. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard Python implementation, which is guaranteed to be compatible with NumPy.

You can download Python 3.9.x from the Python official website: https://www.python.org/downloads/. Here, you can find installation files for Windows, Linux, Mac OS X, and other OS platforms. You can find instructions for installing and using Python for various operating systems at https://docs.python.org/3.7/using/index.html.

You need to have Python 3.5.x or above installed on your system. The sunset date for Python 2.7 was moved from 2015 to 2020, but at the time of writing, Python 2.7 will not be supported and maintained by the Python community.

At the time of writing this book, we had Python 3.8.3 installed as a prerequisite on our Windows 10 virtual machine: https://www.python.org/ftp/python/3.8.3/python-3.8.3.exe.

Python installation and setup on Linux

Installing Python on Linux is significantly easier compared to the other OSes. To install the foundational libraries, run the following command-line instruction:

$ pip3 install numpy scipy pandas matplotlib jupyter notebook

It may be essential to run the sudo command before the preceding command if you don't have sufficient rights on the machine that you are using.

Python installation and setup on Mac OS X with a GUI installer

Python can be installed via the installation file from the Python official website. The installer file can be downloaded from its official web page (https://www.python.org/downloads/mac-osx/) for macOS. This installer also has an IDE named "IDLE" that can be used for development.

Python installation and setup on Mac OS X with brew

For Mac systems, you can use the Homebrew package manager to install Python. It will make it easier to install the required applications for developers, researchers, and scientists. The brew install command is used to install another application, such as installing python3 or any other Python package, such as NLTK or SpaCy.

To install the most recent version of Python, you need to execute the following command in a Terminal:

$ brew install python3

After installation, you can confirm the version of Python you've installed by running the following command:

$ python3 --version
Python 3.7.4

You can also open the Python Shell from the command line by running the following command:

$ python3

Now that we know how to install Python on our system, let's dive into the actual tools that we will need to start data analysis.

Software used in this book

Let's discuss the software that will be used in this book. In this book, we are going to use Anaconda IDE to analyze data. Before installing it, let's understand what Anaconda is.

A Python program can easily run on any system that has it installed. We can write a program on a Notepad and run it on the command prompt. We can also write and run Python programs on different IDEs, such as Jupyter Notebook, Spyder, and PyCharm. Anaconda is a freely available open source package containing various data manipulation IDEs and several packages such as NumPy, SciPy, Pandas, Scikit-learn, and so on for data analysis purposes. Anaconda can easily be downloaded and installed, as follows:

  1. Download the installer from https://www.anaconda.com/distribution/.
  2. Select the operating system that you are using.
  3. From the Python 3.7 section, select the 32-bit or 64-bit installer option and start downloading.
  4. Run the installer by double-clicking on it.
  5. Once the installation is complete, check your program in the Start menu or search for Anaconda in the Start menu.

Anaconda also has an Anaconda Navigator, which is a desktop GUI application that can be used to launch applications such as Jupyter Notebook, Spyder, Rstudio, Visual Studio Code, and JupyterLab:

Now, let's look at IPython, a shell-based computing environment for data analysis.

Using IPython as a shell

IPython is an interactive shell that is equivalent to an interactive computing environment such as Matlab or Mathematica. This interactive shell was created for the purpose of quick experimentation. It is a very useful tool for data professionals that are performing small experiments.

IPython shell offers the following features:

  • Easy access to system commands.
  • Easy editing of inline commands.
  • Tab completion, which helps you find commands and speed up your task.
  • Command History, which helps you view previously used commands.
  • Easily execute external Python scripts.
  • Easy debugging with the Python debugger.

Now, let's execute some commands on IPython. To start IPython, use the following command on the command line:

$ ipython3

When you run the preceding command, the following window will appear:

Now, let's understand and execute some commands that the IPython shell provides:

  • History Commands: The history command used to check the list of previously used commands. The following screenshot shows how to use the history command in IPython:
  • System Commands: We can also run system commands from IPython using the exclamation sign (!). Here, the input command after the exclamation sign is considered a system command. For example, !date will display the current date of the system, while !pwd will show the current working directory:
  • Writing Function: We can write functions as we would write them in any IDE, such as Jupyter Notebook, Python IDLE, PyCharm, or Spyder. Let's look at an example of a function:
  • Quit Ipython Shell: You can exit or quit the IPython shell using quit() or exit() or CTRL + D:

You can also quit the IPython shell using the quit() command:

In this subsection, we have looked at a few basic commands we can use on the IPython shell. Now, let's discuss how we can use the help command in the IPython shell.

Reading manual pages

In the IPython shell, we can open a list of available commands using the help command. It is not compulsory to write the full name of the function. You can just type in a few initial characters and then press the tab button, and it will find the word you are looking for. For example, let's use the arrange() function. There are two ways we can find help about functions:

  • Use the help function: Let's type help and write a few initial characters of the function. After that, press the tab key, select a function using the arrow keys, and press the Enter key:
  • Use a question mark: We can also use a question mark after the name of the function. The following screenshot shows an example of this:

In this subsection, we looked at the help and question mark support that's provided for module functions. We can also get help from library documentation. Let's discuss how to get documentation for data analysis in Python libraries.

Where to find help and references to Python data analysis libraries

The following table lists the documentation websites for the Python data analysis libraries we have discussed in this chapter:

Packages/Software

Description

NumPy

https://numpy.org/doc/

SciPy

https://docs.scipy.org/doc/

Pandas

https://pandas.pydata.org/docs/

Matplotlib

https://matplotlib.org/3.2.1/contents.html

Seaborn

https://seaborn.pydata.org/

Scikit-learn

https://scikit-learn.org/stable/

Anaconda

https://www.anaconda.com/distribution/

You can also find answers to various Python programming questions related to NumPy, SciPy, Pandas, Matplotlib, Seaborn, and Scikit-learn on the StackOverflow platform. You can also raise issues related to the aforementioned libraries on GitHub.

Using JupyterLab

JupyterLab is a next-generation web-based user interface. It offers a combination of data analysis and machine learning product development tools such as a Text Editor, Notebooks, Code Consoles, and Terminals. It's a flexible and powerful tool that should be a part of any data analyst's toolkit:

You can install JupyterLab using conda, pip, or pipenv.

To install using conda, we can use the following command:

$ conda install -c conda-forge jupyterlab

To install using pip, we can use the following command:

$ pip install jupyterlab

To install using pipenv, we can use the following command:

$ pipenv install jupyterlab

In this section, we have learned how to install Jupyter Lab. In the next section, we will focus on Jupyter Notebooks.

Using Jupyter Notebooks

Jupyter Notebook is a web application that's used to create data analysis notebooks that contain code, text, figures, links, mathematical equations, and charts. Recently, the community introduced the next generation of web-based Jupyter Notebooks, called JupyterLab. You can take a look at these notebook collections at the following links:

Often, these notebooks are used as educational tools or to demonstrate Python software. We can import or export notebooks either from plain Python code or from the special notebook format. The notebooks can be run locally, or we can make them available online by running a dedicated notebook server. Certain cloud computing solutions, such as Wakari, PiCloud, and Google Colaboratory, allow you to run notebooks in the cloud.

"Jupyter" is an acronym that stands for Julia, Python, and R. Initially, the developers implemented it for these three languages, but now, it is used for various other languages, including C, C++, Scala, Perl, Go, PySpark, and Haskell:

Jupyter Notebook offers the following features:

  • It has the ability to edit code in the browser with proper indentation.
  • It has the ability to execute code from the browser.
  • It has the ability to display output in the browser.
  • It can render graphs, images, and videos in cell output.
  • It has the ability to export code in PDF, HTML, Python file, and LaTex format.

We can also use both Python 2 and 3 in Jupyter Notebooks by running the following commands in the Anaconda prompt:

# For Python 2.7
conda create -n py27 python=2.7 ipykernel

# For Python 3.5
conda create -n py35 python=3.5 ipykernel

Now that we now about various tools and libraries and also have installed Python, let's move on to some of the advanced features in the most commonly used tool, Jupyter Notebooks.

Advanced features of Jupyter Notebooks

Jupyter Notebook offers various advanced features, such as keyboard shortcuts, installing other kernels, executing shell commands, and using various extensions for faster data analysis operations. Let's get started and understand these features one by one.

Keyboard shortcuts

Users can find all the shortcut commands that can be used inside Jupyter Notebook by selecting the Keyboard Shortcuts option in the Help menu or by using the Cmd + Shift + P shortcut key. This will make the quick select bar appear, which contains all the shortcuts commands, along with a brief description of each. It is easy to use the bar and users can use it when they forget something:

Installing other kernels

Jupyter has the ability to run multiple kernels for different languages. It is very easy to set up an environment for a particular language in Anaconda. For example, an R kernel can be set by using the following command in Anaconda:

$ conda install -c r r-essentials

The R kernel should then appear, as shown in the following screenshot:

Running shell commands

In Jupyter Notebook, users can run shell commands for Unix and Windows. The shell offers a communication interface for talking with the computer. The user needs to put ! (an exclamation sign) before running any command:

Extensions for Notebook

Notebook extensions (or nbextensions) add more features compared to basic Jupyter Notebooks. These extensions improve the user's experience and interface. Users can easily select any of the extensions by selecting the NBextensions tab.

To install nbextension in Jupyter Notebook using conda, run the following command:

conda install -c conda-forge jupyter_nbextensions_configurator

To install nbextension in Jupyter Notebook using pip, run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

If you get permission errors on macOS, just run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install --user

All the configurable nbextensions will be shown in a different tab, as shown in the following screenshot:

Now, let's explore a few useful features of Notebook extensions:

  • Hinterland: This provides an autocompleting menu for each keypress that's made in cells and behaves like PyCharm:
  • Table of Contents: This extension shows all the headings in the sidebar or navigation menu. It is resizable, draggable, collapsible, and dockable:

  • Execute Time: This extension shows when the cells were executed and how much time it will take to complete the cell code:
  • Spellchecker: Spellchecker checks and verifies the spellings that are written in each cell and highlights any incorrectly written words.
  • Variable Selector: This extension keeps track of the user's workspace. It shows the names of all the variables that the user created, along with their type, size, shape, and value.
  • Slideshow: Notebook results can be communicated via Slideshow. This is a great tool for telling stories. Users can easily convert Jupyter Notebooks into slides without the use of PowerPoint. As shown in the following screenshot, Slideshow can be started using the Slideshow option in the cell toolbar of the view menu:

Jupyter Notebook also allows you to show or hide any cell in Slideshow. After adding the Slideshow option to the cell toolbar of the view menu, you can use a Slide Type drop-down list in each cell and select various options, as shown in the following screenshot:

  • Embedding PDF documents: Jupyter Notebook users can easily add PDF documents. The following syntax needs to be run for PDf documents:
from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1811.02141.pdf', width=700, height=400)

This results in the following output:

  • Embedding Youtube Videos: Jupyter Notebook users can easily add YouTube videos. The following syntax needs to be run for adding YouTube videos:
from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU', width=700, height=400)

This results in the following output:

With that, you now understand data analysis, the process that's undertaken by it, and the roles that it entails. You have also learned how to install Python and use Jupyter Lab and Jupyter Notebook. You will learn more about various Python libraries and data analysis techniques in the upcoming chapters.

Summary

In this chapter, we have discussed various data analysis processes, including KDD, SEMMA, and CRISP-DM. We then discussed the roles and skillsets of data analysts and data scientists. After that, we installed NumPy, SciPy, Pandas, Matplotlib, IPython, Jupyter Notebook, Anaconda, and Jupyter Lab, all of which we will be using in this book. Instead of installing all those modules, you can install Anaconda or Jupyter Lab, which has NumPy, Pandas, SciPy, and Scikit-learn built-in.

Then, we got a vector addition program working and learned how NumPy offers superior performance compared to the other libraries. We explored the available documentation and online resources. In addition, we discussed Jupyter Lab, Jupyter Notebook, and their features.

In the next chapter, Chapter 2, NumPy and Pandas, we will take a look at NumPy and Pandas under the hood and explore some of the fundamental concepts surrounding arrays and DataFrames.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Prepare and clean your data to use it for exploratory analysis, data manipulation, and data wrangling
  • Discover supervised, unsupervised, probabilistic, and Bayesian machine learning methods
  • Get to grips with graph processing and sentiment analysis

Description

Data analysis enables you to generate value from small and big data by discovering new patterns and trends, and Python is one of the most popular tools for analyzing a wide variety of data. With this book, you’ll get up and running using Python for data analysis by exploring the different phases and methodologies used in data analysis and learning how to use modern libraries from the Python ecosystem to create efficient data pipelines. Starting with the essential statistical and data analysis fundamentals using Python, you’ll perform complex data analysis and modeling, data manipulation, data cleaning, and data visualization using easy-to-follow examples. You’ll then understand how to conduct time series analysis and signal processing using ARMA models. As you advance, you’ll get to grips with smart processing and data analytics using machine learning algorithms such as regression, classification, Principal Component Analysis (PCA), and clustering. In the concluding chapters, you’ll work on real-world examples to analyze textual and image data using natural language processing (NLP) and image analytics techniques, respectively. Finally, the book will demonstrate parallel computing using Dask. By the end of this data analysis book, you’ll be equipped with the skills you need to prepare data for analysis and create meaningful data visualizations for forecasting values from data.

Who is this book for?

This book is for data analysts, business analysts, statisticians, and data scientists looking to learn how to use Python for data analysis. Students and academic faculties will also find this book useful for learning and teaching Python data analysis using a hands-on approach. A basic understanding of math and working knowledge of the Python programming language will help you get started with this book.

What you will learn

  • Explore data science and its various process models
  • Perform data manipulation using NumPy and pandas for aggregating, cleaning, and handling missing values
  • Create interactive visualizations using Matplotlib, Seaborn, and Bokeh
  • Retrieve, process, and store data in a wide range of formats
  • Understand data preprocessing and feature engineering using pandas and scikit-learn
  • Perform time series analysis and signal processing using sunspot cycle data
  • Analyze textual data and image data to perform advanced analysis
  • Get up to speed with parallel computing using Dask

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Feb 05, 2021
Length: 478 pages
Edition : 3rd
Language : English
ISBN-13 : 9781789955248
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Feb 05, 2021
Length: 478 pages
Edition : 3rd
Language : English
ISBN-13 : 9781789955248
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 152.97
Python Data Cleaning Cookbook
$48.99
Hands-On Data Analysis with Pandas
$64.99
Python Data Analysis
$38.99
Total $ 152.97 Stars icon
Banner background image

Table of Contents

19 Chapters
Section 1: Foundation for Data Analysis Chevron down icon Chevron up icon
Getting Started with Python Libraries Chevron down icon Chevron up icon
NumPy and pandas Chevron down icon Chevron up icon
Statistics Chevron down icon Chevron up icon
Linear Algebra Chevron down icon Chevron up icon
Section 2: Exploratory Data Analysis and Data Cleaning Chevron down icon Chevron up icon
Data Visualization Chevron down icon Chevron up icon
Retrieving, Processing, and Storing Data Chevron down icon Chevron up icon
Cleaning Messy Data Chevron down icon Chevron up icon
Signal Processing and Time Series Chevron down icon Chevron up icon
Section 3: Deep Dive into Machine Learning Chevron down icon Chevron up icon
Supervised Learning - Regression Analysis Chevron down icon Chevron up icon
Supervised Learning - Classification Techniques Chevron down icon Chevron up icon
Unsupervised Learning - PCA and Clustering Chevron down icon Chevron up icon
Section 4: NLP, Image Analytics, and Parallel Computing Chevron down icon Chevron up icon
Analyzing Textual Data Chevron down icon Chevron up icon
Analyzing Image Data Chevron down icon Chevron up icon
Parallel Computing Using Dask Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(13 Ratings)
5 star 76.9%
4 star 7.7%
3 star 7.7%
2 star 0%
1 star 7.7%
Filter icon Filter
Top Reviews

Filter reviews by




Sameet Feb 24, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I would definitely recommend this book to anyone who would like to learn machine learning with python it starts right from basics to NLP
Amazon Verified review Amazon
Rachel Mae Lademora Jan 03, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is very informative and well explained in detailed it helped me a lot.
Amazon Verified review Amazon
Rob Jul 19, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Great book. Curated to be easy to read most important concepts and tools.
Amazon Verified review Amazon
Nithin Apr 08, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Python Data Analysis is a great book for beginners as well as anyone looking to brush up Python skills. The best thing about this book is, the author does not assume the target audience to have any Python knowledge and hence, all the topics are detailed with theory and workable practical code. The book covers in detail on Data Visualization packages and methods and touches on Machine Learning and Linear Regression towards the end.Happy learning!
Amazon Verified review Amazon
ER Feb 24, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
If you prefer breadth over depth then this book is for you. It covers a thorough array (no pun intended) of material needed to get started in data science, from setting up your computer all the way to implementing ML algorithms at scale. It provides high level background knowledge about the topics it covers; introduces libraries, techniques, and algorithms commonly used in DS; and includes code examples.That being said, this is not the place to go for deep level explanations or details about potential problems and pitfalls or things to look out for. I like this book as an easy to navigate reference where I can quickly find what I am looking for and get a general idea of what I need to do and what my options are.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.