Python Data Analysis

Getting Started with Python Libraries

As you already know, Python has become one of the most popular, standard languages and is a complete package for data science-based operations. Python offers numerous libraries, such as NumPy, Pandas, SciPy, Scikit-Learn, Matplotlib, Seaborn, and Plotly. These libraries provide a complete ecosystem for data analysis that is used by data analysts, data scientists, and business analysts. Python also offers other features, such as flexibility, being easy to learn, faster development, a large active community, and the ability to work on complex numeric, scientific, and research applications. All these features make it the first choice for data analysis.

In this chapter, we will focus on various data analysis processes, such as KDD, SEMMA, and CRISP-DM. After this, we will provide a comparison between data analysis and data science, as well as the roles and different skillsets for data analysts and data scientists. Finally, we will shift our focus and start installing various Python libraries, IPython, Jupyter Lab, and Jupyter Notebook. We will also look at various advanced features of Jupyter Notebooks.

In this introductory chapter, we will cover the following topics:

Understanding data analysis
The standard process of data analysis
The KDD process
SEMMA
CRISP-DM
Comparing data analysis and data science
The skillsets of data analysts and data scientists
Installing Python 3
Software used in this book
Using IPython as a shell

Using Jupyter Lab
Using Jupyter Notebooks
Advanced features of Jupyter Notebooks

Let's get started!

Understanding data analysis

The 21st century is the century of information. We are living in the age of information, which means that almost every aspect of our daily life is generating data. Not only this, but business operations, government operations, and social posts are also generating huge data. This data is accumulating day by day due to data being continually generated from business, government, scientific, engineering, health, social, climate, and environmental activities. In all these domains of decision-making, we need a systematic, generalized, effective, and flexible system for the analytical and scientific process so that we can gain insights into the data that is being generated.

In today's smart world, data analysis offers an effective decision-making process for business and government operations. Data analysis is the activity of inspecting, pre-processing, exploring, describing, and visualizing the given dataset. The main objective of the data analysis process is to discover the required information for decision-making. Data analysis offers multiple approaches, tools, and techniques, all of which can be applied to diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

NumPy: This is a short form of numerical Python. It is the most powerful scientific library available in Python for handling multidimensional arrays, matrices, and methods in order to compute mathematics efficiently.
SciPy: This is also a powerful scientific computing library for performing scientific, mathematical, and engineering operations.
Pandas: This is a data exploration and manipulation library that offers tabular data structures such as DataFrames and various methods for data analysis and manipulation.
Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a machine learning library that offers a variety of supervised and unsupervised algorithms, such as regression, classification, dimensionality reduction, cluster analysis, and anomaly detection.

Matplotlib: This is a core data visualization library and is the base library for all other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts, and figures for data exploration. It runs on top of NumPy and SciPy.
Seaborn: This is based on Matplotlib and offers easy to draw, high-level, interactive, and more organized plots.
Plotly: Plotly is a data visualization library. It offers high quality and interactive graphs, such as scatter charts, line charts, bar charts, histograms, boxplots, heatmaps, and subplots.

Installation instructions for the required libraries and software will be provided throughout this book when they're needed. In the meantime, let's discuss various data analysis processes, such as the standard process, KDD, SEMMA, and CRISP-DM.

CRISP-DM

CRISP-DM's full form is CRoss-InduStry Process for Data Mining. CRISP-DM is a well-defined, well-structured, and well-proven process for machine learning, data mining, and business intelligence projects. It is a robust, flexible, cyclic, useful, and practical approach to solving business problems. The process discovers hidden valuable information or patterns from several databases. The CRISP-DM process has six major phases:

Business Understanding: In this first phase, the main objective is to understand the business scenario and requirements for designing an analytical goal and initial action plan.
Data Understanding: In this phase, the main objective is to understand the data and its collection process, perform data quality checks, and gain initial insights.
Data Preparation: In this phase, the main objective is to prepare analytics-ready data. This involves handling missing values, outlier detection and handling, normalizing data, and feature engineering. This phase is the most time-consuming for data scientists/analysts.
Modeling: This is the most exciting phase of the whole process since this is where you design the model for prediction purposes. First, the analyst needs to decide on the modeling technique and develop models based on data.
Evaluation: Once the model has been developed, it's time to assess and test the model's performance on validation and test data using model evaluation measures such as MSE, RMSE, R-Square for regression and accuracy, precision, recall, and the F1-measure.
Deployment: In this final phase, the model that was chosen in the previous step will be deployed to the production environment. This requires a team effort from data scientists, software developers, DevOps experts, and business professionals.

The following diagram shows the full cycle of the CRISP-DM process:

The standard process focuses on discovering insights and making interpretations in the form of a story, while KDD focuses on data-driven pattern discovery and visualizing this. SEMMA majorly focuses on model building tasks, while CRISP-DM focuses on business understanding and deployment. Now that we know about some of the processes surrounding data analysis, let's compare data analysis and data science to find out how they are related, as well as what makes them different from one other.

Comparing data analysis and data science

Data analysis is the process in which data is explored in order to discover patterns that help us make business decisions. It is one of the subdomains of data science. Data analysis methods and tools are widely utilized in several business domains by business analysts, data scientists, and researchers. Its main objective is to improve productivity and profits. Data analysis extracts and queries data from different sources, performs exploratory data analysis, visualizes data, prepares reports, and presents it to the business decision-making authorities.

On the other hand, data science is an interdisciplinary area that uses a scientific approach to extract insights from structured and unstructured data. Data science is a union of all terms, including data analytics, data mining, machine learning, and other related domains. Data science is not only limited to exploratory data analysis and is used for developing models and prediction algorithms such as stock price, weather, disease, fraud forecasts, and recommendations such as movie, book, and music recommendations.

The roles of data analysts and data scientists

A data analyst collects, filters, processes, and applies the required statistical concepts to capture patterns, trends, and insights from data and prepare reports for making decisions. The main objective of the data analyst is to help companies solve business problems using discovered patterns and trends. The data analyst also assesses the quality of the data and handles the issues concerning data acquisition. A data analyst should be proficient in writing SQL queries, finding patterns, using visualization tools, and using reporting tools Microsoft Power BI, IBM Cognos, Tableau, QlikView, Oracle BI, and more.

Data scientists are more technical and mathematical than data analysts. Data scientists are research- and academic-oriented, whereas data analysts are more application-oriented. Data scientists are expected to predict a future event, whereas data analysts extract significant insights out of data. Data scientists develop their own questions, while data analysts find answers to given questions. Finally, data scientists focus on what is going to happen, whereas data analysts focus on what has happened so far. We can summarize these two roles using the following table:

Features	Data Scientist	Data Analyst
Background	Predict future events and scenarios based on data	Discover meaningful insights from the data.
Role	Formulate questions that can profit the business	Solve the business questions to make decisions.
Type of data	Work on both structured and unstructured data	Only work on structured data
Programming	Advanced programming	Basic programming
Skillset	Knowledge of statistics, machine learning algorithms, NLP, and deep learning	Knowledge of statistics, SQL, and data visualization
Tools	R, Python, SAS, Hadoop, Spark, TensorFlow, and Keras	Excel, SQL, R, Tableau, and QlikView

Now that we know what defines a data analyst and data scientist, as well as how they are different from each other, let's have a look at the various skills that you would need to become one of them.

The skillsets of data analysts and data scientists

A data analyst is someone who discovers insights from data and creates value out of it. This helps decision-makers understand how the business is performing. Data analysts must acquire the following skills:

Exploratory Data Analysis (EDA): EDA is an essential skill for data analysts. It helps with inspecting data to discover patterns, test hypotheses, and assure assumptions.
Relational Database: Knowledge of at least one of the relational database tools, such as MySQL or Postgre, is mandatory. SQL is a must for working on relational databases.
Visualization and BI Tools: A picture speaks more than words. Visuals have more of an impact on humans and visuals are a clear and easy option for representing the insights. Visualization and BI tools such as Tableau, QlikView, MS Power BI, and IBM Cognos can help analysts visualize and prepare reports.
Spreadsheet: Knowledge of MS Excel, WPS, Libra, or Google Sheets is mandatory for storing and managing data in tabular form.
Storytelling and Presentation Skills: The art of storytelling is another necessary skill. A data analyst should be an expert in connecting data facts to an idea or an incident and turning it into a story.

On the other hand, the primary job of a data scientist is to solve problems using data. In order to do this, they need to understand the client's requirements, their domain, their problem space, and ensure that they get exactly what they really want. The tasks that data scientists undertake vary from company to company. Some companies use data analysts and offer the title of data scientist just to glorify the job designation. Some combine data analyst tasks with data engineers and offer data scientists designation; others assign them to machine learning-intensive tasks with data visualizations.

The task of the data scientist varies, depending on the company. Some employ data scientists as well-known data analysts and combine their responsibilities with data engineers. Others give them the task of performing intensive data visualization on machines.

A data scientist has to be a jack of all trades and wear multiple hats, including those of a data analyst, statistician, mathematician, programmer, ML, or NLP engineer. Most people are not skilled enough or experts in all these trades. Also, getting skilled enough requires lots of effort and patience. This is why data science cannot be learned in 3 or 6 months. Learning data science is a journey. A data scientist should have a wide variety of skills, such as the following:

Mathematics and Statistics: Most machine learning algorithms are based on mathematics and statistics. Knowledge of mathematics helps data scientists develop custom solutions.
Databases: Knowledge of SQL allows data scientists to interact with the database and collect the data for prediction and recommendation.
Machine Learning: Knowledge of supervised machine learning techniques such as regression analysis, classification techniques, and unsupervised machine learning techniques such as cluster analysis, outlier detection, and dimensionality reduction.
Programming Skills: Knowledge of programming helps data scientists automate their suggested solutions. Knowledge of Python and R is recommended.
Storytelling and Presentation skills: Communicating the results in the form of storytelling via PowerPoint presentations.
Big Data Technology: Knowledge of big data platforms such as Hadoop and Spark helps data scientists develop big data solutions for large-scale enterprises.
Deep Learning Tools: Deep learning tools such as Tensorflow and Keras are utilized in NLP and image analytics.

Apart from these skillsets, knowledge of web scraping packages/tools for extracting data from diverse sources and web application frameworks such as Flask or Django for designing prototype solutions is also obtained. It is all about the skillset for data science professionals.

Now that we have covered the basics of data analysis and data science, let's dive into the basic setup needed to get started with data analysis. In the next section, we'll learn how to install Python.

Installing Python 3

The installer file for installing Python 3 can easily be downloaded from the official website (https://www.python.org/downloads/) for Windows, Linux, and Mac 32-bit or 64-bit systems. The installer can be installed by double-clicking on it. This installer also has an IDE named "IDLE" that can be used for development. We will dive deeper into each of the operating systems in the next few sections.

Python installation and setup on Windows

This book is based on the latest Python 3 version. All the code that will be used in this book is written in Python 3, so we need to install Python 3 before we can start coding. Python is an open source, distributed, and freely available language. It is also licensed for commercial use. There are many implementations of Python, including commercial implementations and distributions. In this book, we will focus on the standard Python implementation, which is guaranteed to be compatible with NumPy.

You can download Python 3.9.x from the Python official website: https://www.python.org/downloads/. Here, you can find installation files for Windows, Linux, Mac OS X, and other OS platforms. You can find instructions for installing and using Python for various operating systems at https://docs.python.org/3.7/using/index.html.

You need to have Python 3.5.x or above installed on your system. The sunset date for Python 2.7 was moved from 2015 to 2020, but at the time of writing, Python 2.7 will not be supported and maintained by the Python community.

At the time of writing this book, we had Python 3.8.3 installed as a prerequisite on our Windows 10 virtual machine: https://www.python.org/ftp/python/3.8.3/python-3.8.3.exe.

Python installation and setup on Linux

Installing Python on Linux is significantly easier compared to the other OSes. To install the foundational libraries, run the following command-line instruction:

$ pip3 install numpy scipy pandas matplotlib jupyter notebook

It may be essential to run the sudo command before the preceding command if you don't have sufficient rights on the machine that you are using.

Python installation and setup on Mac OS X with a GUI installer

Python can be installed via the installation file from the Python official website. The installer file can be downloaded from its official web page (https://www.python.org/downloads/mac-osx/) for macOS. This installer also has an IDE named "IDLE" that can be used for development.

Python installation and setup on Mac OS X with brew

For Mac systems, you can use the Homebrew package manager to install Python. It will make it easier to install the required applications for developers, researchers, and scientists. The brew install command is used to install another application, such as installing python3 or any other Python package, such as NLTK or SpaCy.

To install the most recent version of Python, you need to execute the following command in a Terminal:

$ brew install python3

After installation, you can confirm the version of Python you've installed by running the following command:

$ python3 --version
Python 3.7.4

You can also open the Python Shell from the command line by running the following command:

$ python3

Now that we know how to install Python on our system, let's dive into the actual tools that we will need to start data analysis.

Using IPython as a shell

IPython is an interactive shell that is equivalent to an interactive computing environment such as Matlab or Mathematica. This interactive shell was created for the purpose of quick experimentation. It is a very useful tool for data professionals that are performing small experiments.

IPython shell offers the following features:

Easy access to system commands.
Easy editing of inline commands.
Tab completion, which helps you find commands and speed up your task.

Command History, which helps you view previously used commands.
Easily execute external Python scripts.
Easy debugging with the Python debugger.

Now, let's execute some commands on IPython. To start IPython, use the following command on the command line:

$ ipython3

When you run the preceding command, the following window will appear:

Now, let's understand and execute some commands that the IPython shell provides:

History Commands: The history command used to check the list of previously used commands. The following screenshot shows how to use the history command in IPython:

System Commands: We can also run system commands from IPython using the exclamation sign (!). Here, the input command after the exclamation sign is considered a system command. For example, !date will display the current date of the system, while !pwd will show the current working directory:

Writing Function: We can write functions as we would write them in any IDE, such as Jupyter Notebook, Python IDLE, PyCharm, or Spyder. Let's look at an example of a function:

Quit Ipython Shell: You can exit or quit the IPython shell using quit() or exit() or CTRL + D:

You can also quit the IPython shell using the quit() command:

In this subsection, we have looked at a few basic commands we can use on the IPython shell. Now, let's discuss how we can use the help command in the IPython shell.

Reading manual pages

In the IPython shell, we can open a list of available commands using the help command. It is not compulsory to write the full name of the function. You can just type in a few initial characters and then press the tab button, and it will find the word you are looking for. For example, let's use the arrange() function. There are two ways we can find help about functions:

Use the help function: Let's type help and write a few initial characters of the function. After that, press the tab key, select a function using the arrow keys, and press the Enter key:

Use a question mark: We can also use a question mark after the name of the function. The following screenshot shows an example of this:

In this subsection, we looked at the help and question mark support that's provided for module functions. We can also get help from library documentation. Let's discuss how to get documentation for data analysis in Python libraries.

Where to find help and references to Python data analysis libraries

The following table lists the documentation websites for the Python data analysis libraries we have discussed in this chapter:

Packages/Software	Description
NumPy	https://numpy.org/doc/
SciPy	https://docs.scipy.org/doc/
Pandas	https://pandas.pydata.org/docs/
Matplotlib	https://matplotlib.org/3.2.1/contents.html
Seaborn	https://seaborn.pydata.org/
Scikit-learn	https://scikit-learn.org/stable/
Anaconda	https://www.anaconda.com/distribution/

You can also find answers to various Python programming questions related to NumPy, SciPy, Pandas, Matplotlib, Seaborn, and Scikit-learn on the StackOverflow platform. You can also raise issues related to the aforementioned libraries on GitHub.

Advanced features of Jupyter Notebooks

Jupyter Notebook offers various advanced features, such as keyboard shortcuts, installing other kernels, executing shell commands, and using various extensions for faster data analysis operations. Let's get started and understand these features one by one.

Keyboard shortcuts

Users can find all the shortcut commands that can be used inside Jupyter Notebook by selecting the Keyboard Shortcuts option in the Help menu or by using the Cmd + Shift + P shortcut key. This will make the quick select bar appear, which contains all the shortcuts commands, along with a brief description of each. It is easy to use the bar and users can use it when they forget something:

Installing other kernels

Jupyter has the ability to run multiple kernels for different languages. It is very easy to set up an environment for a particular language in Anaconda. For example, an R kernel can be set by using the following command in Anaconda:

$ conda install -c r r-essentials

The R kernel should then appear, as shown in the following screenshot:

Running shell commands

In Jupyter Notebook, users can run shell commands for Unix and Windows. The shell offers a communication interface for talking with the computer. The user needs to put ! (an exclamation sign) before running any command:

Extensions for Notebook

Notebook extensions (or nbextensions) add more features compared to basic Jupyter Notebooks. These extensions improve the user's experience and interface. Users can easily select any of the extensions by selecting the NBextensions tab.

To install nbextension in Jupyter Notebook using conda, run the following command:

conda install -c conda-forge jupyter_nbextensions_configurator

To install nbextension in Jupyter Notebook using pip, run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

If you get permission errors on macOS, just run the following command:

pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install --user

All the configurable nbextensions will be shown in a different tab, as shown in the following screenshot:

Now, let's explore a few useful features of Notebook extensions:

Hinterland: This provides an autocompleting menu for each keypress that's made in cells and behaves like PyCharm:

Table of Contents: This extension shows all the headings in the sidebar or navigation menu. It is resizable, draggable, collapsible, and dockable:

Execute Time: This extension shows when the cells were executed and how much time it will take to complete the cell code:

Spellchecker: Spellchecker checks and verifies the spellings that are written in each cell and highlights any incorrectly written words.

Variable Selector: This extension keeps track of the user's workspace. It shows the names of all the variables that the user created, along with their type, size, shape, and value.

Slideshow: Notebook results can be communicated via Slideshow. This is a great tool for telling stories. Users can easily convert Jupyter Notebooks into slides without the use of PowerPoint. As shown in the following screenshot, Slideshow can be started using the Slideshow option in the cell toolbar of the view menu:

Jupyter Notebook also allows you to show or hide any cell in Slideshow. After adding the Slideshow option to the cell toolbar of the view menu, you can use a Slide Type drop-down list in each cell and select various options, as shown in the following screenshot:

Embedding PDF documents: Jupyter Notebook users can easily add PDF documents. The following syntax needs to be run for PDf documents:

from IPython.display import IFrame
IFrame('https://arxiv.org/pdf/1811.02141.pdf', width=700, height=400)

This results in the following output:

Embedding Youtube Videos: Jupyter Notebook users can easily add YouTube videos. The following syntax needs to be run for adding YouTube videos:

from IPython.display import YouTubeVideo
YouTubeVideo('ukzFI9rgwfU', width=700, height=400)

This results in the following output:

With that, you now understand data analysis, the process that's undertaken by it, and the roles that it entails. You have also learned how to install Python and use Jupyter Lab and Jupyter Notebook. You will learn more about various Python libraries and data analysis techniques in the upcoming chapters.

Filter reviews by

All

Packt verified reviews

Amazon verified reviews

Sameet Feb 24, 2021

I would definitely recommend this book to anyone who would like to learn machine learning with python it starts right from basics to NLP

Amazon Verified review

Rachel Mae Lademora Jan 03, 2024

This book is very informative and well explained in detailed it helped me a lot.

Rob Jul 19, 2021

Great book. Curated to be easy to read most important concepts and tools.

Nithin Apr 08, 2021

Python Data Analysis is a great book for beginners as well as anyone looking to brush up Python skills. The best thing about this book is, the author does not assume the target audience to have any Python knowledge and hence, all the topics are detailed with theory and workable practical code. The book covers in detail on Data Visualization packages and methods and touches on Machine Learning and Linear Regression towards the end.Happy learning!

ER Feb 24, 2021

If you prefer breadth over depth then this book is for you. It covers a thorough array (no pun intended) of material needed to get started in data science, from setting up your computer all the way to implementing ML algorithms at scale. It provides high level background knowledge about the topics it covers; introduces libraries, techniques, and algorithms commonly used in DS; and includes code examples.That being said, this is not the place to go for deep level explanations or details about potential problems and pitfalls or things to look out for. I like this book as an easy to navigate reference where I can quickly find what I am looking for and get a general idea of what I need to do and what my options are.