About the Book
Understanding and finding patterns in data has become one of the most important ways to improve business decisions. If you know the basics of SQL, but don't know how to use it to gain business insights from data, this book is for you.
SQL for Data Analytics covers everything you need to progress from simply knowing basic SQL to telling stories and identifying trends in data. You'll be able to start exploring your data by identifying patterns and unlocking deeper insights. You'll also gain experience working with different types of data in SQL, including time series, geospatial, and text data. Finally, you'll learn how to become productive with SQL with the help of profiling and automation to gain insights faster.
By the end of the book, you'll able to use SQL in everyday business scenarios efficiently and look at data with the critical eye of an analytics professional.
About the Authors
Upom Malik is a data scientist who has worked in the technology industry for over 6 years. He has a master's degree in chemical engineering from Cornell University and a bachelor's degree in biochemistry from Duke University. He uses SQL and other tools to solve interesting challenges in finance, energy, and consumer technologies. While working on analytical problems, he has lived out of a suitcase and spent the last year as a digital nomad. Outside of work, he likes to read, hike the trails of the Northeastern United States, and savor ramen bowls from around the world.
Matt Goldwasser is a lead data scientist at T. Rowe Price. He enjoys demystifying data science for business stakeholders and deploying production machine learning solutions. Matt has been using SQL to perform data analytics in the financial industry for the last 8 years. He has a bachelor's degree in mechanical and aerospace engineering from Cornell University. In his spare time, he enjoys teaching his infant son data science.
Benjamin Johnston is a senior data scientist for one of the world's leading data-driven medtech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his PhD in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years' experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Learning Objectives
By the end of this book, you will be able to:
- Use SQL to summarize and identify patterns in data
- Apply special SQL clauses and functions to generate descriptive statistics
- Use SQL queries and subqueries to prepare data for analysis
- Perform advanced statistical calculations using the window function
- Analyze special data types in SQL, including geospatial data and time data
- Import and export data using a text file and PostgreSQL
- Debug queries that won't run
- Optimize queries to improve their performance for faster results
Audience
If you're a database engineer looking to transition into analytics, or a backend engineer who wants to develop a deeper understanding of production data, you will find this book useful. This book is also ideal for data scientists or business analysts who want to improve their data analytics skills using SQL. Knowledge of basic SQL and database concepts will aid in understanding the concepts covered in this book.
Approach
SQL for Data Analysis perfectly balances theory and practical exercises and provides a hands-on approach to analyzing data. It focuses on providing practical instruction for both SQL and statistical analysis so that you can better understand your data. The book takes away the crumbs and focuses on being practical. It contains multiple activities that use real-life business scenarios for you to practice and apply your new skills in a highly relevant context.
Hardware Requirements
For the optimal experience, we recommend the following hardware configuration:
- Processor: Intel Core i5 or equivalent
- Memory: 4 GB of RAM
- Storage: 5 GB of available space
Software Requirements
We also recommend that you have the following software installed in advance:
- OS: Windows 7 SP1 64-bit, Windows 8.1 64-bit, Windows 10 64-bit, Linux (Ubuntu 16.04 or later, Debian, Red Hat, or Suse), or the latest version of macOS
- PostgreSQL 10.9 (https://www.postgresql.org/download/)
- Anaconda Python 3.7 (https://www.anaconda.com/distribution/#download-section)
- Git 2 or later
Conventions
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"It is worth noting here that the formatting can look a little messy for the \copy
command, because it does not allow for commands with new lines. A simple way around this is to create a view containing your data before the \copy
command and then drop the view after your \copy
command has finished."
A block of code is set as follows:
CREATE TEMP VIEW customers_sample AS ( SELECT * FROM customers LIMIT 5 ); \copy customers_sample TO 'my_file.csv' WITH CSV HEADER DROP VIEW customers_sample;
Installation and Setup
Each great journey begins with a humble step, and our upcoming adventure in the land of data wrangling is no exception. Before we can do awesome things with data, we need to be prepared with the most productive environment. In this short section, we shall see how to do that.
Installing PostgreSQL 10.9
Installing on Windows:
Download the PostgreSQL version 10 installer via https://www.postgresql.org/download/windows/ and follow the prompts.
Installing on Linux:
You can install PostgreSQL on Ubuntu or Debian Linux via the command line using:
sudo apt-get install postgresl-11
Installing on macOS:
Download the PostgreSQL version 10 installer via https://www.postgresql.org/download/macosx/ and follow the prompts.
Installing Python
Installing Python on Windows:
- Find your desired version of Python on the official installation page at https://www.anaconda.com/distribution/#windows.
- Ensure you select Python 3.7 from the download page.
- Ensure that you install the correct architecture for your computer system; that is, either 32-bit or 64-bit. You can find out this information in the System Properties window of your OS.
- After you download the installer, simply double-click on the file and follow the user-friendly prompts on-screen.
Installing Python on Linux:
To install Python on Linux, you have a couple of good options:
- Open Command Prompt and verify that
p\Python 3
is not already installed by runningpython3 --version
. - To install Python 3, run this:
sudo apt-get update sudo apt-get install python3.7
- If you encounter problems, there are numerous sources online that can help you troubleshoot the issue.
- Install Anaconda Linux by downloading the installer from https://www.anaconda.com/distribution/#linux and following the instructions.
Installing Python on macOS:
Similar to Linux, you have a couple of methods for installing Python on a Mac. To install Python on macOS X, do the following:
- Open the Terminal for Mac by pressing CMD + Spacebar, type
terminal
in the open search box, and hit Enter. - Install Xcode through the command line by running
xcode-select --install
. - The easiest way to install Python 3 is using Homebrew, which is installed through the command line by running
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
. - Add Homebrew to your
$PATH
environment variable. Open your profile in the command line by runningsudo nano ~/.profile
and insertingexport PATH="/usr/local/opt/python/libexec/bin:$PATH" at the bottom
. - The final step is to install Python. In the command line, run
brew install python
. - Again, you can also install Python via the Anaconda installer available from https://www.anaconda.com/distribution/#macos.
Installing Git
Installing Git on Windows or macOS X:
Git for Windows/Mac can be downloaded and installed via https://git-scm.com/. However, for an improved user experience, it is recommended that you install Git through an advanced client such as GitKraken (https://www.gitkraken.com/).
Installing Git on Linux:
Git can be easily installed via the command line:
sudo apt-get install git
If you prefer a graphical user interface, GitKraken (https://www.gitkraken.com/) is also available for Linux.
Loading the Sample Databases
The vast majority of exercises in this book use a sample database, sqlda
, which contains fabricated data for a fictional electric vehicle company called ZoomZoom. To load the sample datasets, please follow the instructions at: https://github.com/TrainingByPackt/SQL-for-Data-Analytics/blob/master/Datasets/Loading_the_sample_datasets_instructions.pdf.
Running SQL Files
Commands and statements can be executed via a *.sql
file from the command line using the command:
psql < commands.sql
Alternatively, they can be executed via the SQL interpreter:
database=#
Additional Resources
The code bundle for this book is also hosted on GitHub at https://github.com/TrainingByPackt/SQL-for-Data-Analytics. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
You can download the graphic bundle for the book from here: https://github.com/TrainingByPackt/SQL-for-Data-Analytics/blob/master/Graphic%20Bundle/Graphic%20Bundle_ColorImages.pdf.