About the Book
From banking and manufacturing through to education and entertainment, using data science for business has revolutionized almost every sector in the modern world. It has an important role to play in everything from app development to network security.
Taking an interactive approach to learning the fundamentals, this book is ideal for beginners. You'll learn all the best practices and techniques for applying data science in the context of real-world scenarios and examples.
Starting with an introduction to data science and machine learning, you'll start by getting to grips with Jupyter functionality and features. You'll use Python libraries like scikit-learn, pandas, Matplotlib, and Seaborn to perform data analysis and data preprocessing on real-world datasets from within your own Jupyter environment. Progressing through the chapters, you'll train classification models using scikit-learn, and assess model performance using advanced validation techniques. Towards the end, you'll use Jupyter Notebooks to document your research, build stakeholder reports, and even analyze web performance data.
By the end of The Applied Data Science Workshop, Second Edition, you'll be prepared to progress from being a beginner to taking your skills to the next level by confidently applying data science techniques and tools to real-world projects.
Audience
If you are an aspiring data scientist who wants to build a career in data science or a developer who wants to explore the applications of data science from scratch and analyze data in Jupyter using Python libraries, then this book is for you. Although a brief understanding of Python programming and machine learning is recommended to help you grasp the topics covered in the book more quickly, it is not mandatory.
About the Chapters
Chapter 1, Introduction to Jupyter Notebooks, will get you started by explaining how to use the Jupyter Notebook and JupyterLab platforms. After going over the basics, we will discuss some fantastic features of Jupyter, which include tab completion, magic functions, and new additions to the JupyterLab interface. Finally, we will look at the Python libraries we'll be using in this book, such as pandas, seaborn, and scikit-learn.
Chapter 2, Data Exploration with Jupyter, is focused on exploratory analysis in a live Jupyter Notebook environment. Here, you will use visualizations such as scatter plots, histograms, and violin plots to deepen your understanding of the data. We will also walk through some simple modeling problems with scikit-learn.
Chapter 3, Preparing Data for Predictive Modeling, will enable you to plan a machine learning strategy and assess whether or not data is suitable for modeling. In addition to this, you'll learn about the process involved in preparing data for machine learning algorithms, and apply this process to sample datasets using pandas.
Chapter 4, Training Classification Models, will introduce classification algorithms such as SVMs, KNNs, and Random Forests. Using a real-world Human Resources analytics dataset, we'll train and compare models that predict whether an employee will leave their company. You'll learn about training models with scikit-learn and use decision boundary plots to see what overfitting looks like.
Chapter 5, Model Validation and Optimization, will give you hands-on experience with model testing and model selection concepts, including k-fold cross-validation and validation curves. Using these techniques, you'll learn how to optimize model parameters and compare model performance reliably. You will also learn how to implement dimensionality reduction techniques such as Principal Component Analysis (PCA).
Chapter 6, Web Scraping with Jupyter Notebooks, will focus on data acquisition from online sources such as web pages and APIs. You will see how data can be downloaded from the web using HTTP requests and HTML parsing. After collecting data in this way, you'll also revisit concepts learned in earlier chapters, such as data processing, analysis, visualization, and modeling.
Conventions
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"It's recommended to install some of these (such as mlxtend
, watermark
, and graphviz
) ahead of time if you have access to an internet connection now. This can be done by opening a new Terminal window and running the pip
or conda
commands."
Words that you see on the screen (for example, in menus or dialog boxes) appear in the same format.
A block of code is set as follows:
https://github.com/rasbt/mlxtend pip install mlxtend
New terms and important words are shown like this:
"The focus of this chapter is to introduce Jupyter Notebooks—the data science tool that we will be using throughout the book."
Code Presentation
Lines of code that span multiple lines are split using a backslash ( \
). When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.
For example:
history = model.fit(X, y, epochs=100, batch_size=5, verbose=1, \ validation_split=0.2, shuffle=False)
Comments are added into code to help explain specific bits of logic. Single-line comments are denoted using the #
symbol, as follows:
# Print the sizes of the dataset print("Number of Examples in the Dataset = ", X.shape[0]) print("Number of Features for each example = ", X.shape[1])
Multi-line comments are enclosed by triple quotes, as shown below:
""" Define a seed for the random number generator to ensure the result will be reproducible """ seed = 1 np.random.seed(seed) random.set_seed(seed)
Setting up Your Environment
Before we explore the book in detail, we need to set up specific software and tools. In the following section, we shall see how to do that.
Installing Python
The easiest way to get up and running with this workshop is to install the Anaconda Python distribution. This can be done as follows:
- Navigate to the Anaconda downloads page from https://www.anaconda.com/.
- Download the most recent Python 3 distribution for your operating system – currently, the most stable version is Python 3.7.
- Open and run the installation package. If prompted, select
yes
for the option toRegister Anaconda as my default Python
.