Hands-On Data Science with R

Getting Started with Data Science and R

“It is a capital mistake to theorise before one has data.”
― Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes

Data, like science, has been ubiquitous the world over since early history. The term data science is not generally taken to literally mean science with data, since without data there would be of science. Rather, it is a specialized field in which data scientists and other practitioners apply advanced computing techniques, usually along with algorithms or predictive analytics to uncover insights that may be challenging to obtain with traditional methods.

Data science as a distinct subject was proposed since the early 1960s by pioneers and thought leaders such as Peter Naur, Prof. Jeff Wu, and William Cleveland. Today, we have largely realized the vision that Prof. Wu and others had in mind when the concept first arose; data science as an amalgamation of computing, data mining, and predictive analytics, all leading up to deriving key insights that drive business and growth across the world today.

The driving force behind this has been the rapid but proportional growth of computing capabilities and algorithms. Computing languages have also played a key role in supporting the emergence of data science, primary among them being the statistical language R.

In this introductory chapter, we will cover the following topics:

Introduction to data science and R
Active domains of data science
Solving problems with data science
Using R for data science
Setting up R and RStudio
Our first R program

Introduction to data science

The term, data science, as mentioned earlier, was first proposed in the 1960s and 1970s by Peter Naur. In the late 1990s, Jeff Wu, while at the University of Michigan, Ann Arbor, proposed the term in a formal paper titled Statistics = Data Science?. The paper, which Prof. Wu subsequently presented at the seventh series of P.C. Mahalonobis Lectures at the Indian Statistical Institute in 1998, raised some interesting questions about what an appropriate definition of statistics might be in light of the tasks that a statistician did beyond numerical calculations.

In the paper Prof. Wu highlighted the concept of Statistical Trilogy, consisting of data collection, data modeling and analysis, and problem solving. The following sections reflected upon the future directions in which Dr. Wu raised the prospects of neural network models to model complex, non-linear relationships, the use of cross validation to improve model performance, and data mining of large-scale data among others. [Source: https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf].

The paper, although written more than 20 years ago, is a reflection of the foresight that a few academicians such as Dr. Wu had at the time, which has been realized in full, almost verbatim as it was propositioned back then, both in thought and practical concepts. A copy of Dr. Wu's paper is available at https://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf.

Key components of data science

The practice of data science requires the application of three distinct disciplines to uncover insights from data. These disciplines are as follows:

Computer science
Predictive analytics
Domain knowledge

The following diagram shows the core components of data science:

Computer science

During the course of performing data science, if large datasets are involved, the practitioner may spend a fair amount of time cleansing and curating the dataset. In fact, it is not uncommon for data scientists to spend the majority of their time preparing data for analysis. The generally accepted distribution of time for a data science project involves 80% spent in data management and the remaining 20% spent in the actual analysis of the data.

While this may seem or sound overly general, the growth of big data, that is, large-scale datasets, usually in the range of terabytes, has meant that it takes sufficient time and effort to extract data before the actual analysis takes place. Real-world data is seldom perfect. Issues with real-world data range from missing variables to incorrect entries and other deficiencies. The size of datasets also poses a formidable challenge.

Technologies such as Hadoop, Spark, and NoSQL databases have addressed the needs of the data science community for managing and curating terabytes, if not petabytes, of information. These tools are usually the first step in the overall data science process that precedes the application of algorithms on the datasets using languages such as R, Python and others.

Hence, as a first step, the data scientist generally should be capable of working with datasets using contemporary tools for large-scale data mining. For instance, if the data resides in a Hadoop cluster, the practitioner must be able and willing to perform the work necessary to retrieve and curate the data from the source systems.

Second, once the data has been retrieved and curated, the data scientist should be aware of the requirements of the algorithm from a computational perspective and determine if the system has the necessary resources to efficiently execute these algorithms. For instance, if the algorithms can be taken advantage of with multi-core computing facilities, the practitioner must use the appropriate packages and functions to leverage. This may mean the difference between getting results in an hour versus requiring an entire day.

Last, but not least, the creation of machine learning models will require programming in one or more languages. This in itself demands a level of knowledge and skill in applying algorithms and using appropriate data structures and other computer science concepts:

Predictive analytics (machine learning)

In popular media and literature, predictive analytics is known by various names. The terms are used interchangeably and often depend on personal preferences and interpretations. The terms predictive analytics, machine learning, and statistical learning are technically synonymous, and refer to the field of applying algorithms in machine learning to the data.

The algorithm could be as simple as a line-of-best-fit, which you may have already used in Excel, also known as linear regression. Or it could be a complex deep learning model that implements multiple hidden layers and inputs. In both cases, the mere fact that a statistical model, an algorithm was applied to generate a prediction qualifies the usage as a practice of machine learning.

In general, creating a machine learning model involves a series of steps such as the sequence:

Cleanse and curate the dataset to extract the cohort on which the model will be built.
Analyze the data using descriptive statistics, for example, distributions and visualizations.
Feature engineering, preprocessing, and other steps necessary to add or remove variables/predictors.
Split the data into a train and test set (for example, set aside 80% of the data for training and the remaining 20% for testing your model).
Select appropriate machine learning models and create the model using cross validation.
Select the final model after assessing the performance across models on a given (one or more) cost metric. Note that the model could be an ensemble, that is, a combination of more than one model.
Perform predictions on the test dataset.
Deliver the final model.

The most commonly used languages for machine learning today are R and Python. In Python, the most popular package for machine learning is scikit-learn (http://scikit-learn.org), while in R, there are multiple packages, such as random forest, Gradient Boosting Machine (GBM), kernlab, Support Vector Machines (SVMs), and others.

Although Python's scikit-learn is extremely versatile and elaborate, and in fact the preferred language in production settings, the ease of use and diversity of packages in R gives it an advantage in terms of early adoption and use for machine learning exercises.

The Comprehensive R Archive Network (CRAN) has a task view page titled CRAN Task View: Machine Learning & Statistical Learning (https://cran.r-project.org/web/views/MachineLearning.html) that summarizes some of the key packages in use today for machine learning using R.

Popular machine learning tools such as TensorFlow from Google (https://www.tensorflow.org), XGBoost (http://xgboost.readthedocs.io/en/latest/), and H2O (https://www.h2o.ai) have also released packages that act as a wrapper to the underlying machine learning algorithms implemented in the respective tools.

It is a common misconception that machine learning is just about creating models. While that is indeed the end goal, there is a subtle yet fundamental difference between a model and a good model. With the functions available today, it is relatively easy for anyone to create a model by simply running a couple of lines of code. A good model has business value, while a model built without the rigor of formal machine learning principles is practically unusable for all intents and purposes. A key requirement of a good machine learning model is the judicious use of domain expertise to evaluate results, identify errors, analyze them, and further refine using the insights that subject matter experts can provide. This is where domain knowledge plays a crucial and indispensable role.

Domain knowledge

More often than data scientists would like to admit, machine learning models produce results that are obvious and intuitive. For instance, we once conducted an elaborate analysis of physicians, prescribing behavior to find out the strongest predictor of how many prescriptions a physician would write in the next quarter. We used a broad set of input variables such as the physicians locations, their specialties, hospital affiliations, prescribing history, and other data. In the end, the best performing model produced a result that we all knew very well. The strongest predictor of how many prescriptions a physician would write in the next quarter was the number of prescriptions the physician had written in the previous quarter! To filter out the truly meaningful variables and build a more robust model, we eventually had to engage someone who had extensive experience of working in the pharma industry. Machine learning models work best when produced in a hybrid approach—one that combines domain expertise along with the sophistication of models developed.

Active domains of data science

Data science plays a role in virtually all aspects of our day-to-day lives and is used across nearly all industries. The adoption of data science was largely spurred by the successes of start-ups such as Uber, Airbnb, and Facebook that rose rapidly and earned valuations of billions of dollars in a very short span of time.

Data generated by social media networks such as Facebook and Twitter, search engines such as Google and Yahoo!, and various other networks, such as Pinterest and Instagram led to a deluge of information about personal tastes, preferences, and habits of individuals. Companies leveraged the information using various machine learning techniques to gain insights.

For example, Natural Language Processing (NLP) is a machine learning technique used to analyse textual data on comments posted on public forums to extract users' interests. The users are then shown ads relevant to their interests generating sales from which companies earn ad revenue. Image recognition algorithms are utilized to automatically identify objects in an image and serve the relevant images when users search for those objects on search engines.

The use of data science as a means to not only increase user engagement but also increase revenue, has become a widespread phenomenon. Some of the domains in which data science is prevalent is given as follows. The list is not all-inclusive, but highlights some of the key industries in which data science plays an important role today:

A few of these domains have been discussed in the following sections.

Finance

Data science has been used in finance, especially in trading for many decades. Investment banks, especially trading desks, have employed complex models to analyse and make trading decisions. Some examples of data science as used in finance include:

Credit risk management: Analyse the creditworthiness of a user by analyzing the historical financial records, assets, and transactions of the user
Loan fraud: Identifying applications for credit or loans that may be fraudulent by analyzing the loan and applicant's characteristics
Market Basket Analysis: Understanding the correlation among stocks and other securities and formulating trading and hedging strategies
High-frequency trading: Analyzing trades and quotes to discover pricing inefficiencies and arbitrage opportunities

Healthcare

Healthcare and related fields such as pharmaceuticals and life sciences, have also seen a gradual rise in the adoption and use of machine learning. A leading example has been IBM Watson. Developed in late 2000s, IBM Watson rose to popularity after it won the Double Jeopardy, a popular quiz contest in the US in 2011. Today, IBM Watson is being used for clinical research and several institutions have published preliminary results of success. (Source: http://www.ascopost.com/issues/june-25-2017/how-watson-for-oncology-is-advancing-personalized-patient-care/). The primary impediment to wider adoption has been the extremely high cost of using the system with usually an uncertain return on investment. Companies that are generally well capitalized can invest in the technology.

More common uses of data science in healthcare include:

Epidemiology: Preventing the spread of diseases and other epidemiology related use cases are being solved with various machine learning techniques. A recent example of the use of clustering to detect the Ebola outbreak received attention, being one of the first times that machine learning was used in a medical use case very effectively. (Source: https://spectrum.ieee.org/tech-talk/biomedical/diagnostics/healthmap-algorithm-ebola-outbreak).
Health insurance fraud detection: The health insurance industry loses billions each year in the US due to fraudulent claims for insurance. Machine learning, and more generally, data science is being used to detect cases of fraud and reduce the loss incurred by leading health insurance firms. (Source: https://www.sciencedirect.com/science/article/pii/S1877042812036099).
Recommender engines: Algorithms that match patients with physicians are used to provide recommendations based on the patients' symptoms and doctor specialties.
Image recognition: Arguably, the most common use of data science in healthcare, image recognition algorithms are used for a variety of cases ranging from segmentation of malignant and non-malignant tumours to cell segmentation. (Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159221/).

Pharmaceuticals

Although closely linked to the data science use cases in healthcare, data science use cases in pharma are geared toward the development of drugs, physician marketing, and treatment-related analysis. Examples of data science in pharma include the following:

Patient journey and treatment pathways: Understanding the progression of diseases in patients and treatment or therapy outcomes is one of the prime examples of data science in pharma. Several companies have engaged in deep studies related to the development of such tools to understand not only the efficiency of drugs, but also how to best position and market their products. (Source: https://kx.com/blog/use-case-rxdatascience-patient-journey-app/).
Sales field messaging: Using NLP, pharma companies analyse discussions between sales representatives and physicians during sales visits to improve their messaging content and better inform physicians on the potential risks and benefits of medications as needed. (Source: https://www.aktana.com/blog/field-sales/power-personalization-using-advanced-machine-learning-drive-rep-engagement/).
Biomarker analysis: Machine learning for identifying biomarkers and their importance and/or relevance to diseases are used in clinical research such as cancer-related studies. (Source: https://www.futuremedicine.com/doi/abs/10.2217/pme.15.5?journalCode=pme).
Research and development: The use of machine learning for identifying small and large molecules that treat diseases is another common application of data science in pharma. It is a challenging task and several large pharma companies have engaged teams to solve such use cases. (Source: https://www.kaggle.com/c/MerckActivity).

Government

Data science is used by state and national governments for a wide range of uses. These include topics in cyber security, voter benefits, climate change, social causes, and other similar use cases that are geared toward public policy and public benefits.

Some examples include the following:

Climate change: One of the most popular topics among climate change proponents, there is extensive machine learning related work that is being conducted around the globe to detect and understand the causes of climate change. (Source: https://toolkit.climate.gov).
Cyber security: The use of extremely advanced machine learning techniques for national cyber security is evident and well known all over the world, ever since such practices were disclosed by consultants at security firms a few years back. Security-related organizations employ some of the most advanced hardware and software stacks for detecting cyber threats and prevent hacking attempts. (Source: https://www.csoonline.com/article/2942083/big-data-security/cybersecurity-is-the-killer-app-for-big-data-analytics.html).
Social causes: The use of data science for a wide range of use cases geared toward social good is well known due to several conferences and papers that have been organized and released respectively on the topic. Examples include topics in urban analytics, power grids utilizing smart meters, criminal justice. (Source: https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/agenda/).

Manufacturing and retail

The manufacturing and retail industry has used data science to designing better products, optimize pricing, and design strategic marketing techniques. Some examples include the following:

Price optimization: Generally related to the realm of linear programming, the challenge of price optimization, that is, pricing products, is now also being addressed with the help of machine learning. Dynamic pricing based upon market conditions, user preferences, and other factors are used as inputs to assess optimal pricing of products. (Source: https://www.datasciencecentral.com/profiles/blogs/price-optimisation-using-decision-tree-regression-tree).
Retail sales: Retailers use algorithms to determine future sales forecasts, price discounts, and promotion sequences. (Source: http://www.oliverwyman.com/our-expertise/insights/2017/feb/machine-learning-for-retail.html).
Production capacity and maintenance: In manufacturing, data science is being used to determine device maintenance requirements, equipment effectiveness, optimize production lines, and much more. The overall supply chain management is an area that has benefited and continues to earn profits from smart use of machine learning. (Source: https://www.forbes.com/sites/louiscolumbus/2016/06/26/10-ways-machine-learning-is-revolutionizing-manufacturing/#51d4927228c2).

Web industry

One of the earliest beneficiaries of data science was the web industry. Empowered by the collection of user-specific data from social networks, firms around the world employ algorithms to understand user behavior and generate targeted ads. Google, one of the earliest proponents of targeted ad marketing today, earns most of its revenue from ads, more than $95 billion in 2017. (Source: https://www.statista.com/statistics/266249/advertising-revenue-of-google/). The use of data science for web-related businesses is ubiquitous today and companies such as Uber, Airbnb, Netflix, and Amazon have successfully navigated and made full use of this complex ecosystem, generating not only huge profits but also added millions of new jobs directly or indirectly as a result.

Targeted ads: Click through ads have been one of the prime areas of machine learning. By reading cookies saved on users' computers from various sites, other sites can assess the users interests and accordingly decide which ads to serve when they visit new sites. As per online sources, the value of internet advertising is over $1 trillion and has generated over 10 million jobs in 2017 alone. (Source: https://www.iab.com/insights/economic-value-advertising-supported-internet-ecosystem/).
Recommender engines: Netflix, Pandora, and other movies and audio streaming services utilize recommender engines to understand which movies or music the viewer or listener would be interested in and make recommendations. The recommendations are often based on what other users with similar tastes might have already seen and leverage recommender algorithms such as collaborative, content-based, and hybrid filtering.
Web design: Using A/B testing, mouse tracking, and other sophisticated techniques, web developers leverage data science to design better web pages such as landing pages and in general websites. A/B testing for instance allows developers to decide between different versions of the same web page and deploy accordingly.

Other industries

There are various other industries today that benefit from data science and as such, it has become so common that it would be impractical to list all, but at a high level, some of the others include the following:

Oil and natural gas for oil production
Meteorology for understanding weather patterns
Space research for detecting and/or analyzing stars and galaxies
Utilities for energy production and energy savings
Biotechnology for research and finding new cures for diseases

In general, since data science, or machine learning algorithms are not specific to any particular industry, it is entirely possible to apply algorithms to creative use cases and derive business benefits.

Solving problems with data science

Data science is being used today to solve problems ranging from poverty alleviation to scientific research. It has emerged as the leading discipline that aims to disrupt the industry's status quo and provide a new alternative to pressing business issues.

However, while the promise of data science and machine learning is immense, it is important to bear in mind that it takes time and effort to realize the benefits. The return-on-investment on a machine learning project typically takes a fairly long time. It is thus essential to not overestimate the value it can bring in the short run.

A typical data science project in a corporate setting would require the collaborative efforts of various groups, both on the technical and the business side. Generally, this means that the project should have a business sponsor and a technical or analytics lead in addition to the data science team or data scientist. It is important to set expectations at the onset—both in terms of the time it would take to complete the project and the outcome that may be uncertain until the task has completed. Unlike other projects that may have a definite goal, it is not possible to predetermine the outcome of machine learning projects.

Some common questions to ask include the following:

What business value does the data science project bring to the organization?
Does it have a critical base of users, that is, would multiple users benefit from the expected outcome of the project?
How long would it take to complete the project and are all the business stakeholders aware of the timeline?
Have the project stakeholders taken all variables that may affect the timeline into account? Projects can often get delayed due to dependencies on external vendors.
Have we considered all other potential business use cases and made an assessment of what approach would have an optimal chance of success?

A few salient points for successful data science projects are given as follows:

Find projects or use cases related to business operations that are:
- Challenging
- Not necessarily complex, that is, they can be simple tasks but which add business value
- Intuitive, easily understood (you can explain it to friends and family)
- Takes effort to accomplish today or requires a lot of manual effort
- Used frequently by a range of users and the benefits of the outcome would have executive visibility
Identify low difficulty–high value (shorter) versus high difficulty–high value (longer)
Educate business sponsors, share ideas, show enthusiasm (it's like a long job interview)
Score early wins on low difficulty–high value; create minimum viable solutions, get management buy-in before enhancing them (takes time)
Early wins act as a catalyst to foster executive confidence; and also make it easier to justify budgets, making it easier to move on to high difficulty—high value tasks

Using R for data science

Being arguably the oldest and consequently the most mature language for statistical operations, R has been used by statisticians all over the world for over 20 years. The precursor to R was the S programming language, written by John Chambers in 1976 in Bell Labs. R, named after the initials of its developers, Ross Ihaka and Robert Gentleman, was implemented as an open source equivalent to S while they were at the University of Auckland.

The language has gained immensely in popularity since the early 2000s, averaging between 20% to 30% growth on a year-on-year basis:

The growth of R packages

In 2018, there were more than 12,000 R packages, up from about 7,500 just 3 years before, in 2015.

A few key features of R makes it not only very easy to learn, but also very versatile due to the number of available packages.

Key features of R

The key features of R are as follows:

Data mining: The R package, data.table, developed by Dowle and Srinivasan, is arguably one of the most sophisticated packages for data mining in any language provides R users with the ability to query millions, if not billions of rows of data. In addition, there is tibble, an alternative to data.frame developed by Hadley Wickham. Other packages from Wickham include, plyr, dplyr and ggplot2 for visualization.
Visualizations: The ggplot2 package is the most commonly used visualization package in R. Packages such as rcharts, htmlwidgets have also become extremely popular in recent years. Most of these packages allow R users to leverage elegant graphics features commonly found in JavaScript packages such as D3. Many of them act as wrappers for popular JavaScript visualization libraries to facilitate the creation of graphics elements in R.
Data science: R has had various statistical libraries used for research for many years. With the growth of data science as a popular subject in the public domain, R users have released and further developed both new and existing packages that allows users to deploy complex machine learning algorithms. Examples include randomforest, gbm.
General availability of packages: The 12,000+ packages in R provide coverage for a wide range of projects. These include packages for machine learning, data science, and even general purpose needs such as web scraping, cartography, and even fisheries sciences. Due to this rich ecosystem that can cater to the needs of a wide variety of use cases, R has grown exponentially in popularity. Whether you are working with JSON files or trying to solve an obscure machine learning problem, it is very likely that someone in the R community has already developed a package that contains (or can indirectly fulfill) the functionality you need.
Setting up R and RStudio: This book will focus on using R for data science related tasks. The language R, as mentioned, is available as an open source product from http://r-project.org. In addition, we will be installing RStudio—an IDE (a graphical user interface) for writing and running our R code as well as R Shiny, a platform that allows users to develop elegant dashboards.

Downloading and installing R is as follows:

Go to http://r-project.org and click on the CRAN (http://cran.r-project.org/mirrors.html):

Select any one of the links in the corresponding page. These are links to CRAN Mirrors, that is, sites that host R packages and R installation files:

Once you select and click on the link, you'll be taken to a page with the links to download R for different operating systems, such as Windows, macOS, and Linux. Select the distribution that you need to start the download process:

This is the R for macOS download page:

This is the R for Windows download page (click on install R for the first time if it is a new installation):

This is the R for Windows download page. Download and install the .exe file for R:

The R for macOS installation process will require you to download the .dmg file. Select the default options for installation if you do not intend to make any changes, such as installing in a different directory:

You will also need to download and install RStudio and R Shiny. RStudio is used as the frontend, which you'll use to develop your R code. As such, it is not necessary to use RStudio to write code in R as you can launch the R console from the desktop (Windows), but RStudio has a nicer and a more user-friendly interface that makes it easier to code in R.

Download RStudio and R Shiny from https://www.rstudio.com:

Click on Products in the top menu and select RStudio to download and install the software.

Download the open source version of RStudio. Note that there are other versions which are paid commercial versions of the software. For our exercise, we'll be using the open source version only. Download it from https://www.rstudio.com/products/rstudio/download/:

Once you have installed RStudio, launch the application. This will bring up the Following screenshot. There are four panels in RStudio. The first three are shown when you first launch RStudio:

Click on File | New File | R Script. This will open a new panel. This is the section where you'll be writing your R code:

RStudio is a very mature interface for developing R code and has been in use for several years. You should familiarize yourself with the different features in RStudio as you'll be using the tool throughout the book.

Our first R program

In this section, we will create our first R program for data analysis. We'll use the human development data available from the United Nations development program. The initiative produces a Human Development Index (HDI) corresponding to each country, which signifies the level of economic development, including general public health, education, and various other societal factors.

Further information on HDI can be found at http://hdr.undp.org/en/content/human-development-index-hdi.The site also hosts an FAQ page that provides short summary explanations of the various characteristics of the program at http://hdr.undp.org/en/faq-page/human-development-index-hdi.

The following diagram from the UN development program's website summaries the concept at a high level:

UN development index

In this exercise, we will be looking at the life expectancy and expected years of schooling on a per country per year basis starting from 1990 onward. Not all data is available for all countries, due to various geopolitical and other reasons that have made it difficult to obtain data for respective years.

The datasets for the HDP program have been obtained from http://hdr.undp.org/en/data.

In the exercises, the data has been cleaned and formatted to make it easier for the reader to analyse the information, especially given it is the first chapter of the book. Download the data from the Packt code repository for this book. Following are the steps to complete the exercise:

Launch RStudio and click on File | New File | R Script.
Save the file as Chapter1.R.

Copy the commands shown in the following script and save.
Install the required packages for this exercise by running the following command. First, copy the command into the code window in RStudio:

install.packages(c("data.table","plotly","ggplot2","psych"))

Then, place your cursor on the line and click on Run:

This will install the respective packages in your system. In case you encounter any errors, search on Google for the cause of the error. There are various online forums, such as Stack Overflow, where you can search for common errors and learn how to fix them. Since errors can depend on the specific configuration of your machine, we cannot identify all of them, but it is very likely that someone else might have experienced the same error conditions.

We have already created the requisite CSV files, and the following code illustrates the entire process of reading in the CSV files and analyzing the data:


# We'll install the following packages:
## data.table: a package for managing & manipulating datasets in R
## plotly: a graphics library that has gained popularity in recent year
## ggplot2: another graphics library that is extremely popular in R
## psych: a tool for psychmetry that also includes some very helpful #statistical functions

install.packages(c("data.table","plotly","ggplot2","psych"))

# Load the libraries
# This is necessary if you will be using functionalities that are #available outside
# The functions already available as part of standard R

library(data.table)
library(plotly)
library(ggplot2)
library(psych)
library(RColorBrewer)

# In R, packages contain multiple functions and once the package has #been loaded
# the functions become available in your workspace
# To find more information about a function, at the R console, type #in ?function_name
# Note that you should replace function_name with the name of the actual function
# This will bring up the relevant help notes for the function
# Note that the "R Console" is the interactive screen generally #found 

# Read in Human Development Index File
hdi <- fread("ch1_hdi.csv",header=T) # The command fread can be used to read in a CSV file

# View contents of hdi
head(hdi) # View the top few rows of the data table hdi
//

The output of the preceding code is as follows:

Read the life expectancy file by using the following code:

life <- fread("ch1_life_exp.csv", header=T)

# View contents of life
head(life)

The output of the code file is as follows:

Read the years of schooling file by using the following code:

# Read Years of Schooling File
school <- fread("ch1_schoolyrs.csv", header=T)

# View contents of school
head(school)

The output of the preceding code is as follows:

Now we will read the country information:

iso <- fread("ch1_iso.csv")

# View contents of iso
head(iso)

The following is the output of the previous code:

Here we will see the processing of the hdi table by using the following code:

# Use melt.data.table to change hdi into a long table format

hdi <- melt.data.table(hdi,1,2:ncol(hdi))

# Set the names of the columns of hdi
setnames(hdi,c("Country","Year","HDI"))

# Process the life table
# Use melt.data.table to change life into a long table format
life <- melt.data.table(life,1,2:ncol(life))
# Set the names of the columns of hdi
setnames(life,c("Country","Year","LifeExp"))

# Process the school table
# Use melt.data.table to change school into a long table format
school <- melt.data.table(school,1,2:ncol(school))
# Set the names of the columns of hdi
setnames(school,c("Country","Year","SchoolYrs"))

# Merge hdi and life along the Country and Year columns
merged <- merge(merge(hdi, life,
  by=c("Country","Year")),school,by=c("Country","Year"))

# Add the Region attribute to the merged table using the iso file
# This can be done using the merge function
# Type in ?merge in your R console 
merged <- merge(merged, iso, by="Country")
merged$Info <- with(merged, paste(Country,Year,"HDI:",HDI,"LifeExp:",LifeExp,"SchoolYrs:",
  SchoolYrs,sep=" "))

# Use View to open the dataset in a different tab
# Close the tab to return to the code screen
View(head(merged))

The output of the preceding code is as follows:

Here is the code for finding summary statistics for each country:


mergedDataSummary <- 
  describeBy(merged[,c("HDI","LifeExp","SchoolYrs")],  
  group=merged$Country, na.rm = T, IQR=T)


# Which Countries are available in the mergedDataSummary Data Frame ?
names(mergedDataSummary)
mergedDataSummary["Cuba"] # Enter any country name here to view
#the summary information

The output is as follows:

Useing ggplot2 to view density charts and boxplots:

ggplot(merged, aes(x=LifeExp, fill=Region)) + geom_density(alpha=0.25)

The output is as follows:

Now we will see what the result is for geom_boxplot:


ggplot(merged, aes(x=Region, y=LifeExp, fill=Region)) + geom_boxplot()

The output is as follows:

Create an animated chart using plot_ly:

# Reference: https://plot.ly/r/animations/
p <- merged %>%
  plot_ly(
    x = ~SchoolYrs, 
    y = ~LifeExp, 
    color = ~Region, 
    frame = ~Year, 
    text = ~Info,
    size = ~LifeExp,
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  ) %>%
  layout(
    xaxis = list(
      type = "log"
    )
  ) %>% 
  animation_opts(
    150, easing = "elastic", redraw = FALSE
  )

# View plot
p

The output is as follows:

Creating a summary table with the average of SchoolYrs and LifeExp by Region and Year by using the following code:


mergedSummary <- merged[,.(AvgSchoolYrs=round(mean(SchoolYrs, na.rm = 
  T),2), AvgLifeExp=round(mean(LifeExp),2)), by=c("Year","Region")]
  mergedSummary$Info <- with(mergedSummary,
  paste(Region,Year,"AvgLifeExp:",AvgLifeExp,"AvgSchoolYrs:",
  AvgSchoolYrs,sep=" "))


# Create an animated plot similar to the prior diagram
# Reference: https://plot.ly/r/animations/
ps <- mergedSummary %>%
  plot_ly(
    x = ~AvgSchoolYrs, 
    y = ~AvgLifeExp, 
    color = ~Region, 
    frame = ~Year, 
    text = ~Info,
    size=~AvgSchoolYrs,
    opacity=0.75,
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
    ) %>%
  layout(title = 'Average Life Expectancy vs Average School Years 
    (1990-2015)',
         xaxis = list(title="Average School Years"),
         yaxis = list(title="Average Life Expectancy"),
         showlegend = FALSE)
# View plot
ps