Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Jupyter for Data Science
Jupyter for Data Science

Jupyter for Data Science: Exploratory analysis, statistical modeling, machine learning, and data visualization with Jupyter

eBook
Mex$504.99 Mex$721.99
Paperback
Mex$902.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Jupyter for Data Science

Jupyter and Data Science

The Jupyter product was derived from the IPython project. The IPython project was used to provide interactive online access to Python. Over time it became useful to interact with other programming languages, such as R, in the same manner. With this split from only Python, the tool grew into its current manifestation of Jupyter. IPython is still an active tool available for use.

Jupyter is available as a web application for a wide variety of platforms. It can also be used on your desktop/laptop over a wide variety of installations. In this book, we will be exploring using Jupyter from a Windows PC and over the internet for other providers.

Jupyter concepts

Jupyter is organized around a few basic concepts:

  • Notebook: A collection of statements (in a language). For example, this could be a complete R script that loads data, analyzes it, produces a graph, and records results elsewhere.
  • Cell: the lowest granular piece of a Jupyter Notebook that can be worked with:
    • Current Cell: The current cell being edited or the one(s) selected
  • Kernel: each notebook is associated with a specific language implementation. The part of Jupyter which processes the specific language involved is called a kernel.

A first look at the Jupyter user interface

We can jump right in and see what Jupyter has to offer. A Jupyter screen looks like this:

So, Jupyter is deployed as a website that can be accessed on your machine (or can be accessed like any other website across the internet).

We see the URL of the page, http://localhost:8888/tree. localhost is a pseudonym for a web server running on your machine. The website we are accessing on the web server is in a tree display. This is the default display. This conforms to the display of the projects within Jupyter. Jupyter displays objects in a tree layout much like Windows File Explorer. The main page lists a number of projects; each project is its own subdirectory and contains a further delineation of content for each. Depending on where you start Jupyter, the existing contents of the current directory will be included in the display as well.

Detailing the Jupyter tabs

On the web page, we have the soon to be familiar Jupyter logo and three tabs:

  • Files
  • Running
  • Clusters

The Files tab lists the objects available to Jupyter. The files used by Jupyter are stored as regular files on your disk. Jupyter provides context managers that know how to process the different types of files and programs you are using. You can see the Jupyter files when you use Windows Explorer to view your file contents (they have an .ipynb file extension). You can see non-Jupyter files listed in the Jupyter window as well.

The Running tab lists the notebooks that have been started. Jupyter keeps track of which notebooks are running. This tab allows you to control which notebooks are running at any time.

The Clusters tab is for environments where several machines are in use for running Jupyter.

Cluster implementations of Jupyter are a topic worthy of their own, dedicated materials.

What actions can I perform with Jupyter?

Next, we see:

  • A prompt Select items to perform action
  • An Upload button
  • A New pull down menu and
  • A Refresh icon

The prompt tells you that you can select multiple items and then perform the same action on all of them. Most of the following actions (in the menus) can be performed over a single item or a selected set of items.

The Upload button will present a prompt to select a file to upload to Jupyter. This would typically be used to move a data file into the project for access in the case where Jupyter is running as a website in a remote location where you can't just copy the file to the disk where Jupyter is running.

The New pull down menu presents a list of choices of the different kinds of Jupyter projects (kernels) that are available:

We can see the list of objects that Jupyter knows how to create:

  • Text File: Create a text file for use in this folder. For example, if the notebook were to import a file you may create the file using this feature.
  • Folder: Yes, just like in Windows File Explorer.
  • Terminals Unavailable: Grayed out, this feature can be used in a Nix environment.
  • Notebooks: Grayed out,-this is not really a file type, but a heading to the different types of notebooks that this installation knows how to create.
  • Julia 0.4.5: Creates a Julia notebook where the coding is in the Julia language.
  • Python 3: Creates a notebook where the coding is in the Python language. This is the default.
  • R: Creates a notebook where the coding is in the R language.
  • Depending on which kernels you have installed in your installation, you may see other notebook types listed.

What objects can Jupyter manipulate?

If we started one of the notebooks (it would automatically be selected in the Jupyter object list) and now looked at the pulldown of actions against the objects selected we would see a display like the following:

We see that the menu action has changed to Rename, as that is the most likely action to be taken on one file and we have an icon to delete the project as well (the trashcan icon).

The item count is now 1 (we have one object selected in the list), the icon for the one item is a filled in blue square (denoting that it is a running project), and a familiar Home icon to bring us back to the Jupyter home page display in the previous screenshot.

The object's menu has choices for:

  • Folders: select the folders available
  • All Notebooks: select the Jupyter Notebooks
  • Running: select the running Jupyter Notebooks
  • Files: select the files in the directory

If we scroll down in the object display, we see a little different information in the list of objects available. Each of the objects listed has a type (denoted by the icon shape associated) and a name assigned by the user when it was created.

Each of the objects is a Jupyter project that can be accessed, shared, and moved on its own. Every project has a full name, as entered by the user creating the project, and an icon that portrays this entry as a project. We will see other Jupyter icons corresponding to other project components, as follows:

Viewing the Jupyter project display

If we pull down the New menu and select Python 3, Jupyter would create a new Python notebook and move to display its contents. We would see a display like the following:

We have created a new Jupyter Notebook and are in its display. The logo is there. The title defaults to Untitled, which we can change by clicking on it. There is an (autosaved) marker that tells you Jupyter has automatically stored your notebook to disk (and will continue to do so regularly as you work on it).

We now have a menu bar and a denotation that this notebook is using Python 3 as its source language. The menu choices are:

  • File: Standard file operations
  • Edit: For editing cell contents (more to come)
  • View: To change the display of the notebook
  • Insert: To insert a cell in the notebook
  • Cell: To change the format, usage of a cell
  • Kernel: To adjust the kernel used for the notebook
  • Help: To bring up the help system for Jupyter

File menu

The File menu has the following choices:

  • New Notebook: Similar to the pull down from the home page.
  • Open...: Open a notebook.
  • Make a Copy...: Copy a notebook.
  • Rename...: Rename a notebook.
  • Save and Checkpoint: Save the current notebook at a checkpoint. Checkpoints are specific points in a notebook's history that you want to maintain in order to return to a checkpoint if you change your mind about a recent set of changes.
  • Print Preview: Similar to any print preview that you have used otherwise.
  • Download as: Allows you to store the notebook in a variety of formats. The most notable formats would be PDF or Excel, which would allow you to share the notebook with users that do not have access to Jupyter.
  • Trusted Notebook: (The feature is grayed out). When a notebook is opened by a user, the server computes a signature with the user's key, and compares it with the signature stored in the notebook's metadata. If the signature matches, HTML and JavaScript output in the notebook will be trusted at load, otherwise it will be untrusted.
  • Close and Halt: Close the current notebook and stop it running in the Jupyter system.

Edit menu

The Edit menu has the following choices:

  • Cut Cells: Typical cut operation.
  • Copy Cells: Assuming you are used to the GUI operations of copying cells to memory buffer and later pasting into another location in the notebook.
  • Paste Cells Above: If you have selected a cell and if you have copied a cell, this option will not be grayed out and will paste the buffered cell above the current cell.
  • Paste Cells Below: Similar to the previous option.
  • Delete Cells: Will delete the selected cells.
  • Undo Delete Cells.
  • Split Cell: There is a style issue here, regarding how many statements you put into a cell. Many times, you will start with one cell containing a number of statements and split that cell up many times to break off individual or groups of statements into their own cell.
  • Merge Cell Above: Combine the current cell with the one above it.
  • Merge Cell Below: Similar to the previous option.
  • Move Cell Up: Move the current cell before the one above it.
  • Move Cell Down.
  • Edit Notebook Metadata: For advanced users to modify the internal programming language used by Jupyter for your notebook.
  • Find and Replace: Locate specific text within cells and possibly replace.

View menu

The View menu has the following choices:

  • Toggle Header: Toggle the display of the Jupyter header
  • Toggle Toolbar: Toggle the display of the Jupyter toolbar
  • Cell Toolbar: Change the displayed items for the cell being edited:
    • None: Don't display a cell toolbar
    • Edit Metadata: Edit a cells metadata directly
    • Raw Cell Format: Edit the cell raw format as used by Jupyter
    • Slideshow: Walk through the cells in a slideshow manner

Insert menu

The Insert menu has the following choices:

  • Insert Cell Above: Insert the copied buffer cell in front of the current cell
  • Insert Cell Below: Same as previous one

Cell menu

The Cell menu has the following choices:

  • Run Cells: Runs all of the cells in the notebook
  • Run Cells and Select Below: Runs cells and selects all of the cells below the current
  • Run Cells and Insert Below: Runs cells and adds a blank cell
  • Run All: Runs all of the cells
  • Run All Above: Runs all of the cells above the current
  • Run All Below: Runs all of the cells below the current
  • Cell Type: Changes the type of the selected cell(s) to:
    • Code: this is the default—the cell would expect to have language statements
    • Markdown: The cell contains HTML markdown,-typically used to display the notebook in the best manner (as it is a website, so has all of HTML available to it)
    • Raw NBConvert: This is an internal Jupyter format, basically plain text
  • Current Outputs: Whether to clear or continue the outputs from the cells
  • All Output

Kernel menu

The Kernel menu is used to control the underlying language engine used by the notebook. The menu choices are as follows. I think many of the choices in this menu are used very little:

  • Interrupt: Momentarily stops the underlying language engine and then lets it continue
  • Restart: Restarts the underlying language engine
  • Restart & Clear Output
  • Restart & Run All
  • Reconnect: If you were to interrupt the kernel, you would then need to reconnect to start running again
  • Change kernel: Changes the language used in this notebook to one available in your installation

Help menu

The help menu displays the help options for Jupyter and language context choices. For example, in our Python notebook we see choices for common Python libraries that may be used:

Icon toolbar

Just below the regular menu is an icon toolbar with many of the commonly used menu items for faster use, as seen in this view:

The icons correspond to the previous menu choices (listed in order of appearance):

  • File/Save the current notebook
  • Insert cell below
  • Cut current cells
  • Copy the current cells
  • Paste cells below
  • Move selected cells up
  • Move selected cells down
  • Run from selected cells down
  • Interrupt the kernel
  • Restart kernel
  • List of formats we can apply to the current cells
  • An icon to open a command palette with descriptive names
  • An icon to open the cell toolbar

How does it look when we execute scripts?

If we were to provide a name for the notebook, enter a simple Python script, and execute the notebook cells, we would see a display like this:

The script is:

name = "Dan Toomey"
state = "MA"
print(name + " lives in " + state)

We assign a value to the name and state variables and then print them out.

If you notice, I have placed the statements into two different cells. This is just for readability. They could all be in the same cell or three different cells.

There are line numbers assigned to each cell. The numbering always starts at 1 for the first cell, then as you move cells around the numbering may grow (as you can see the first cell is labeled cell 2 in the display).

Below the second cell, we have non-editable display results. Jupyter always displays any corresponding output of a cell just below. This could include error information as well.

Industry data science usage

This book is about Jupyter and data science. We have the introduction to Jupyter. Now, we can look at data science practices and then see how the two concepts work together.

Data science is used in many industries. It is interesting to note the predominant technologies involved and algorithms used by industry. We can see the same technologies available within Jupyter.

Some of the industries that are larger users of data science include:

Industry

Larger data science use

Technology/algorithms

Finance

Hedge funds

Python

Gambling

Establish odds

R

Insurance

Measure and price risk

Domino (R)

Retail banking

Risk, customer analytics, product analytics

R

Mining

Smart exploration, yield optimization

Python

Consumer products

Pricing and distribution

R

Healthcare

Drug discovery and trials

Python

All of these data science investigations could be done in Jupyter, as the languages used are fully supported.

Real life examples

In this section we see several examples taken from current industry focus and apply them in Jupyter to ensure its utility.

Finance, Python - European call option valuation

There is an example of this at https://www.safaribooksonline.com/library/view/python-for-finance/9781491945360/ch03.html which is taken from the book Python for Finance by Yves Hilpisch. The model used is fairly standard for finance work.

We want to arrive at the theoretical value of a call option. A call option is the right to buy a security, such as IBM stock, at a specific (strike) price within a certain time frame. The option is priced based on the riskiness or volatility of the security in relation to the strike price and current price. The example uses a European option which can only be exercised at maturity-this simplifies the problem set.

The example is using Black-Scholes model for option valuation where we have:

  • Initial stock index level S0 = 100
  • Strike price of the European call option K = 105
  • Time-to-maturity T = 1 year
  • Constant, riskless short rate r = 5%
  • Constant volatility σ  = 20%

These elements make up the following formula:

The algorithm used is as follows:

  1. Draw I (pseudo) random numbers from the standard normal distribution.
  2. Calculate all resulting index levels at maturity ST(i) for given z(i) in the previous equation. Calculate all inner values of the option at maturity as hT(i) = max(ST(i) - K,0).
  3. Estimate the option present value via the Monte Carlo estimator given in the following equation:

The script is as follows. We use numpy for the intense mathematics used. The rest of the coding is typical:

from numpy import *
# set parameters
S0 = 100.
K = 105.
T = 1.0
r = 0.05
sigma = 0.2
# how many samples we are using
I = 100000
random.seed(103)
z = random.standard_normal(I)
ST = S0 * exp((r - 0.5 * sigma ** 2) * T + sigma * sqrt(T) * z)
hT = maximum(ST - K, 0)
C0 = exp(-r * T) * sum(hT) / I
# tell user results
print ("Value of the European Call Option %5.3f" % C0)

The results under Jupyter are as shown in the following screenshot:

The 8.071 value corresponds with the published expected value 8.019 due to variance in the random numbers used. (I am seeding the random number generator to have reproducible results).

Finance, Python - Monte Carlo pricing

Another algorithm in popular use is Monte Carlo simulation. In Monte Carlo, as the name of the gambling resort implies, we simulate a number of chances taken in a scenario where we know the percentage outcomes of the different results, but do not know exactly what will happen in the next N chances. We can see this model being used at http://www.codeandfinance.com/pricing-options-monte-carlo.html. In this example, we are using Black-Scholes again, but in a different direct method where we see individual steps.

The coding is as follows. The Python coding style for Jupyter is slightly different than used directly in Python, as you can see by the changed imports near the top of the code. Rather than just pulling in the functions you want from a library, you pull in the entire library and the coding uses what is needed:

import datetime
import random # import gauss
import math #import exp, sqrt
random.seed(103)
def generate_asset_price(S,v,r,T):
return S * exp((r - 0.5 * v**2) * T + v * sqrt(T) * gauss(0,1.0))
def call_payoff(S_T,K):
return max(0.0,S_T-K)
S = 857.29 # underlying price
v = 0.2076 # vol of 20.76%
r = 0.0014 # rate of 0.14%
T = (datetime.date(2013,9,21) - datetime.date(2013,9,3)).days / 365.0
K = 860.
simulations = 90000
payoffs = []
discount_factor = math.exp(-r * T)
for i in xrange(simulations):
S_T = generate_asset_price(S,v,r,T)
payoffs.append(
call_payoff(S_T, K)
)
price = discount_factor * (sum(payoffs) / float(simulations))
print ('Price: %.4f' % price)

The results under Jupyter are shown as follows:

The result price of 14.4452 is close to the published value 14.5069.

Gambling, R - betting analysis

Some of the gambling games are really coin flips, with 50/50 chances of success. Along those lines we have coding from http://forumserver.twoplustwo.com/25/probability/flipping-coins-getting-3-row-1233506/ that determines the probability of a series of heads or tails in a coin flip, with a trigger that can be used if you know the coin/game is biased towards one result or the other.

We have the following script:

##############################################
# Biased/unbiased  recursion of heads OR tails
##############################################
import numpy as np
import math

N = 14     # number of flips
m = 3      # length of run (must be  > 1 and <= N/2)
p = 0.5   # P(heads)

prob = np.repeat(0.0,N)
h = np.repeat(0.0,N)
t = np.repeat(0.0,N)

h[m] = math.pow(p,m)
t[m] = math.pow(1-p,m)
prob[m] = h[m] + t[m]

for n in range(m+1,2*m):
  h[n] = (1-p)*math.pow(p,m)
  t[n] = p*math.pow(1-p,m)
  prob[n] = prob[n-1] + h[n] + t[n]


for n in range(2*m,N):
  h[n] = ((1-p) - t[n-m] - prob[n-m-1]*(1-p))*math.pow(p,m)
  t[n] = (p - h[n-m] - prob[n-m-1]*p)*math.pow(1-p,m)
  prob[n] = prob[n-1] + h[n] + t[n]

prob[N-1]  

The preceding code produces the following output in Jupyter:

We end up with the probability of getting three heads in a row with an unbiased game. In this case, there is a 92% chance (within the range of tests we have run 14 flips).

Insurance, R - non-life insurance pricing

We have an example of using R to come up with the pricing for non-life products, specifically mopeds, at http://www.cybaea.net/journal/2012/03/13/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM/. The code first creates a table of the statistics available for the product line, then compares the pricing to actual statistics in use.

The first part of the code that accumulates the data is as follows:

con <- url("http://www2.math.su.se/~esbj/GLMbook/moppe.sas")
data <- readLines(con, n = 200L, warn = FALSE, encoding = "unknown")
close(con)
## Find the data range
data.start <- grep("^cards;", data) + 1L
data.end <- grep("^;", data[data.start:999L]) + data.start - 2L
table.1.2 <- read.table(text = data[data.start:data.end],
header = FALSE,
sep = "",
quote = "",
col.names = c("premiekl", "moptva", "zon", "dur",
"medskad", "antskad", "riskpre", "helpre", "cell"),
na.strings = NULL,
colClasses = c(rep("factor", 3), "numeric",
rep("integer", 4), "NULL"),
comment.char = "")
rm(con, data, data.start, data.end)
# Remainder of Script adds comments/descriptions
comment(table.1.2) <-
c("Title: Partial casco moped insurance from Wasa insurance, 1994--1999",
"Source: http://www2.math.su.se/~esbj/GLMbook/moppe.sas",
"Copyright: http://www2.math.su.se/~esbj/GLMbook/")
## See the SAS code for this derived field
table.1.2$skadfre = with(table.1.2, antskad / dur)
## English language column names as comments:
comment(table.1.2$premiekl) <-
c("Name: Class",
"Code: 1=Weight over 60kg and more than 2 gears",
"Code: 2=Other")
comment(table.1.2$moptva) <-
c("Name: Age",
"Code: 1=At most 1 year",
"Code: 2=2 years or more")
comment(table.1.2$zon) <-
c("Name: Zone",
"Code: 1=Central and semi-central parts of Sweden's three largest cities",
"Code: 2=suburbs and middle-sized towns",
"Code: 3=Lesser towns, except those in 5 or 7",
"Code: 4=Small towns and countryside, except 5--7",
"Code: 5=Northern towns",
"Code: 6=Northern countryside",
"Code: 7=Gotland (Sweden's largest island)")
comment(table.1.2$dur) <-
c("Name: Duration",
"Unit: year")
comment(table.1.2$medskad) <-
c("Name: Claim severity",
"Unit: SEK")
comment(table.1.2$antskad) <- "Name: No. claims"
comment(table.1.2$riskpre) <-
c("Name: Pure premium",
"Unit: SEK")
comment(table.1.2$helpre) <-
c("Name: Actual premium",
"Note: The premium for one year according to the tariff in force 1999",
"Unit: SEK")
comment(table.1.2$skadfre) <-
c("Name: Claim frequency",
"Unit: /year")
## Save results for later
save(table.1.2, file = "table.1.2.RData")
## Print the table (not as pretty as the book)
print(table.1.2)

The resultant first 10 rows of the table are as follows:

       premiekl moptva zon    dur medskad antskad riskpre helpre    skadfre
    1         1      1   1   62.9   18256      17    4936   2049 0.27027027
    2         1      1   2  112.9   13632       7     845   1230 0.06200177
    3         1      1   3  133.1   20877       9    1411    762 0.06761833
    4         1      1   4  376.6   13045       7     242    396 0.01858736
    5         1      1   5    9.4       0       0       0    990 0.00000000
    6         1      1   6   70.8   15000       1     212    594 0.01412429
    7         1      1   7    4.4    8018       1    1829    396 0.22727273
    8         1      2   1  352.1    8232      52    1216   1229 0.14768532
    9         1      2   2  840.1    7418      69     609    738 0.08213308
    10        1      2   3 1378.3    7318      75     398    457 0.05441486

Then, we go through each product/statistics to determine whether the pricing for a product is in line with others. Note, the repos = clause on the install.packages statement is a fairly new addition to R:

# make sure the packages we want to use are installed
install.packages(c("data.table", "foreach", "ggplot2"), dependencies = TRUE, repos = "http://cran.us.r-project.org")
# load the data table we need
if (!exists("table.1.2"))
load("table.1.2.RData")
library("foreach")
## We are looking to reproduce table 2.7 which we start building here,
## add columns for our results.
table27 <-
data.frame(rating.factor =
c(rep("Vehicle class", nlevels(table.1.2$premiekl)),
rep("Vehicle age", nlevels(table.1.2$moptva)),
rep("Zone", nlevels(table.1.2$zon))),
class =
c(levels(table.1.2$premiekl),
levels(table.1.2$moptva),
levels(table.1.2$zon)),
stringsAsFactors = FALSE)
## Calculate duration per rating factor level and also set the
## contrasts (using the same idiom as in the code for the previous
## chapter). We use foreach here to execute the loop both for its
## side-effect (setting the contrasts) and to accumulate the sums.
# new.cols are set to claims, sums, levels
new.cols <-
foreach (rating.factor = c("premiekl", "moptva", "zon"),
.combine = rbind) %do%
{
nclaims <- tapply(table.1.2$antskad, table.1.2[[rating.factor]], sum)
sums <- tapply(table.1.2$dur, table.1.2[[rating.factor]], sum)
n.levels <- nlevels(table.1.2[[rating.factor]])
contrasts(table.1.2[[rating.factor]]) <-
contr.treatment(n.levels)[rank(-sums, ties.method = "first"), ]
data.frame(duration = sums, n.claims = nclaims)
}
table27 <- cbind(table27, new.cols)
rm(new.cols)
#build frequency distribution
model.frequency <-
glm(antskad ~ premiekl + moptva + zon + offset(log(dur)),
data = table.1.2, family = poisson)
rels <- coef( model.frequency )
rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] )
table27$rels.frequency <-
c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")],
c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")],
c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")])
# note the severities involved
model.severity <-
glm(medskad ~ premiekl + moptva + zon,
data = table.1.2[table.1.2$medskad > 0, ],
family = Gamma("log"), weights = antskad)
rels <- coef( model.severity )
rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] )
## Aside: For the canonical link function use
## rels <- rels[1] / (rels[1] + rels[-1])
table27$rels.severity <-
c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")],
c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")],
c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")])
table27$rels.pure.premium <- with(table27, rels.frequency * rels.severity)
print(table27, digits = 2)

The resultant display is as follows:

       rating.factor class duration n.claims rels.frequency rels.severity
    1  Vehicle class     1     9833      391           1.00          1.00
    2  Vehicle class     2     8825      395           0.78          0.55
    11   Vehicle age     1     1918      141           1.55          1.79
    21   Vehicle age     2    16740      645           1.00          1.00
    12          Zone     1     1451      206           7.10          1.21
    22          Zone     2     2486      209           4.17          1.07
    3           Zone     3     2889      132           2.23          1.07
    4           Zone     4    10069      207           1.00          1.00
    5           Zone     5      246        6           1.20          1.21
    6           Zone     6     1369       23           0.79          0.98
    7           Zone     7      148        3           1.00          1.20
       rels.pure.premium
    1               1.00
    2               0.42
    11              2.78
    21              1.00
    12              8.62
    22              4.48
    3               2.38
    4               1.00
    5               1.46
    6               0.78
    7               1.20

Here, we can see that some vehicle classes (2,6) are priced very low in comparison to statistics for that vehicle where as other are overpriced (1222).

Consumer products, R - marketing effectiveness

We take the example from a presentation I made at www.dantoomeysoftware.com/Using_R_for_Marketing_Research.pptx looking at the effectiveness of different ad campaigns for grape fruit juice.

The code is as follows:

#library(s20x)
library(car)

#read the dataset from an existing .csv file
df <- read.csv("C:/Users/Dan/grapeJuice.csv",header=T)

#list the name of each variable (data column) and the first six rows of the dataset
head(df)

# basic statistics of the variables
summary(df)

#set the 1 by 2 layout plot window
par(mfrow = c(1,2))

# boxplot to check if there are outliers
boxplot(df$sales,horizontal = TRUE, xlab="sales")

# histogram to explore the data distribution shape
hist(df$sales,main="",xlab="sales",prob=T)
lines(density(df$sales),lty="dashed",lwd=2.5,col="red")

#divide the dataset into two sub dataset by ad_type
sales_ad_nature = subset(df,ad_type==0)
sales_ad_family = subset(df,ad_type==1)

#calculate the mean of sales with different ad_type
mean(sales_ad_nature$sales)
mean(sales_ad_family$sales)

#set the 1 by 2 layout plot window
par(mfrow = c(1,2))

# histogram to explore the data distribution shapes
hist(sales_ad_nature$sales,main="",xlab="sales with nature production theme ad",prob=T)
lines(density(sales_ad_nature$sales),lty="dashed",lwd=2.5,col="red")

hist(sales_ad_family$sales,main="",xlab="sales with family health caring theme ad",prob=T)
lines(density(sales_ad_family$sales),lty="dashed",lwd=2.5,col="red")  

With output (several sections):

(raw data from file, first 10 rows):

sales

price

ad_type

price_apple

price_cookies

1

222

9.83

0

7.36

8.8

2

201

9.72

1

7.43

9.62

3

247

10.15

1

7.66

8.9

4

169

10.04

0

7.57

10.26

5

317

8.38

1

7.33

9.54

6

227

9.74

0

7.51

9.49

 

Statistics on the data are as follows:

         sales           price           ad_type     price_apple   
     Min.   :131.0   Min.   : 8.200   Min.   :0.0   Min.   :7.300  
     1st Qu.:182.5   1st Qu.: 9.585   1st Qu.:0.0   1st Qu.:7.438  
     Median :204.5   Median : 9.855   Median :0.5   Median :7.580  
     Mean   :216.7   Mean   : 9.738   Mean   :0.5   Mean   :7.659  
     3rd Qu.:244.2   3rd Qu.:10.268   3rd Qu.:1.0   3rd Qu.:7.805  
     Max.   :335.0   Max.   :10.490   Max.   :1.0   Max.   :8.290  
     price_cookies   
     Min.   : 8.790  
     1st Qu.: 9.190  
     Median : 9.515  
     Mean   : 9.622  
     3rd Qu.:10.140  
     Max.   :10.580  

The data shows the effectiveness of each campaign. Family sales are more effective:

  • 186.666666666667//mean of nature sales
  • 246.666666666667//mean of family sales

The difference is more pronounced on the histogram displays:

Using Docker with Jupyter

Docker is a mechanism that allows you to have many complete virtual instances of an application in one machine. Docker is used by many software firms to provide a fully scalable implementation of their services, and support as many concurrent users as needed.

Prior mechanisms for dealing with multiple instances shared common resources (such as disk address space). Under Docker, each instance is a complete entity separate from all others.

Implementing Jupyter on a Docker environment allows multiple users to access their own Jupyter instance, without having to worry about interfering with someone else's calculations.

The key feature of Docker is allowing for a variable number of instances of your notebook to be in use at any time. The Docker control system can be set up to create new instances for every user that accesses your notebook. All of this is built-in to Docker without programming; just use the user interface to decide how to create instances.

There are two ways you can use Docker:

  • From a public service
  • Installing Docker on your machine

Using a public Docker service

There are several services out there. I think they work pretty much the same way: sign up for the service, upload your notebook, monitor usage (the Docker control program tracks usage automatically). For example, if we use https://hub.docker.com/ we are really using a version repository for our notebook. Versioning is used in software development for tracking changes that are made over time. This also allows for multiple user access privileges as well:

  1. First, sign up. This provides authentication to the service vendor.
  2. Create a repository—where you will keep your version of the notebook.
  3. You will need Docker installed on your machine to pull/push notebooks from/to your repository.
Installing Docker is operating system dependent. Go to the https://www.docker.com/ home page for instructions for your machine.
  1. Upload (push) your Jupyter image to your repository.
  2. Access your notebook in the repository. You can share the address (URL) of your notebook with others under control of Docker, making specific access rights to different users.
  3. From then on, it will work just as if it were running locally.

Installing Docker on your machine

Docker on your local machine would only be a precursor to posting on a public Docker service, unless the machine you are installing Docker on is accessible by others.

Another option is to have Docker installed on your machine. It works exactly like the previous case, except you are managing the Docker image space.

How to share notebooks with others

There are several ways to share Jupyter Notebooks with others:

  • Email
  • Place onto Google Drive
  • Share on GitHub
  • Store as HTML on a web server
  • Install Jupyter on a web server

Can you email a notebook?

In order to email your notebook, the notebook must be converted to a plain text format, sent as an attachment to the recipient, and then the recipient must convert it back to the 'binary' notebook format.

Email attachments are normally converted to a well-defined MIME (Multi-purpose Internet Mail Extension) format. There is a program available that converts the notebook format, nb2mail, which converts the notebook to a notebook MIME format. The program is available at https://github.com/nfultz/nb2mail.

Usage is as follows:

  • Install nb2mail using pip command (see website)
  • Convert your selected notebook to MIME format
  • Send to recipient
  • The recipient MIME conversion process will store the file in the correct fashion (assuming they have also installed nb2mail)

Sharing a notebook on Google Drive

Google Drive can be used to store your notebook profile information. This might be used when combined with the previous emailing of a notebook to another user. The recipient could use a Google Drive profile that would preclude anyone without the profile information from interacting with the notebook.

You install the python extension (from https://github.com/jupyter/jupyter-drive) using pip and then python -m. From then on, you access the notebooks with the Google Drive profiles, as ipython notebook -profile <profilename>.

Sharing on GitHub

GitHub (and others) allow you to place a notebook on their servers that, once there, can be accessed directly using the nbviewer. The server has installed Python (and other language) coding needed to support your notebook. The nbviewer is a read-only use of your notebook, and is not interactive.

The nbviewer is available at https://github.com/jupyter/nbviewer. The site includes specific parameters which need to be added to the ipython notebook command, such as the command to start the viewer.

Store as HTML on a web server

A built-in feature of notebooks is to export the notebook into different formats. One of those is HTML. In this manner, you could export the notebook into HTML and copy the file(s) onto your web server as changes are made.

The command is jupyter nbconvert <notebook name>.ipynb --to html.

Again, this would be a non-interactive, read-only version of your notebook.

Install Jupyter on a web server

Jupyter is deployed as a web application. If you have direct access to a web server, you could install Jupyter on the web server, create notebooks on that web server, and then the notebooks would be available to others that are completely dynamic.

As a web server you also have control over access to the web server so can control who can access your notebook.

This is an advanced interaction that would require working with your webmaster to determine the correct approach.

How can you secure a notebook?

There are two aspects to security in Jupyter Notebooks:

  • Making sure only specific users can access your notebook
  • Making sure your notebook is not used to host malicious coding

Access control

While many of the uses of Jupyter are solely for educating others, there are instances where the information being accessed is and should remain confidential. Jupyter allows you to put up barriers to entry to your notebook in several manners.

When we identify the user, we are authenticating that user. This is normally done by presenting a login challenge before allowing entry, where the user has to enter a username and password.

If the instance of Jupyter hosting, your notebook is installed on a web server and you can use the web server's access control to limit access to your notebook. Further, most of the vendors that support notebook hosting provide a mechanism to limit access to specific users.

Malicious content

The other aspect of security is to make sure the contents of your notebooks are not malicious. You should make sure your notebook is safe, as follows:

  • Ensure that HTML is sanitized (looking for malicious HTML coding and subverting it)
  • Do not allow your notebook to execute external JavaScript
  • Check cell contents that may be malicious are challenged in a server environment
  • Sanitize output of cells so as not to produce unwanted effects on user machines

Summary

In this chapter, we looked into the details of the Jupyter user interface: what objects does it work with, what actions can be taken by Jupyter, what does the display tell us about the data, and what tools are available? Next, we looked at some real-life examples from industry showing R and Python coding from several industries. Then we saw some of the ways to share our notebook with other users and, correspondingly, how to protect our notebook with different security mechanisms.

In the next chapter, we will see how far we can go using Python in a Jupyter Notebook.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get the most out of your Jupyter notebook to complete the trickiest of tasks in Data Science
  • Learn all the tasks in the data science pipeline—from data acquisition to visualization—and implement them using Jupyter
  • Get ahead of the curve by mastering all the applications of Jupyter for data science with this unique and intuitive guide

Description

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations, and visualizations. This book is a comprehensive guide to getting started with data science using the popular Jupyter notebook. If you are familiar with Jupyter notebook and want to learn how to use its capabilities to perform various data science tasks, this is the book for you! From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter. You will also see how you can utilize Jupyter's features to share your documents and codes with your colleagues. The book also explains how Python 3, R, and Julia can be integrated with Jupyter for various data science tasks. By the end of this book, you will comfortably leverage the power of Jupyter to perform various tasks in data science successfully.

Who is this book for?

This book targets students and professionals who wish to master the use of Jupyter to perform a variety of data science tasks. Some programming experience with R or Python, and some basic understanding of Jupyter, is all you need to get started with this book.

What you will learn

  • Understand why Jupyter notebooks are a perfect fit for your data science tasks
  • Perform scientific computing and data analysis tasks with Jupyter
  • Interpret and explore different kinds of data visually with charts, histograms, and more
  • Extend SQL s capabilities with Jupyter notebooks
  • Combine the power of R and Python 3 with Jupyter to create dynamic notebooks
  • Create interactive dashboards and dynamic presentations
  • Master the best coding practices and deploy your Jupyter notebooks efficiently

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 20, 2017
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781785880070
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Oct 20, 2017
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781785880070
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Mex$85 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Mex$ 3,160.97
Learning Jupyter
Mex$1128.99
Pandas Cookbook
Mex$1128.99
Jupyter for Data Science
Mex$902.99
Total Mex$ 3,160.97 Stars icon
Banner background image

Table of Contents

10 Chapters
Jupyter and Data Science Chevron down icon Chevron up icon
Working with Analytical Data on Jupyter Chevron down icon Chevron up icon
Data Visualization and Prediction Chevron down icon Chevron up icon
Data Mining and SQL Queries Chevron down icon Chevron up icon
R with Jupyter Chevron down icon Chevron up icon
Data Wrangling Chevron down icon Chevron up icon
Jupyter Dashboards Chevron down icon Chevron up icon
Statistical Modeling Chevron down icon Chevron up icon
Machine Learning Using Jupyter Chevron down icon Chevron up icon
Optimizing Jupyter Notebooks Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
(2 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 100%
Steve Gailey Nov 18, 2017
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
I don't think I have ever read a book so devoid of actual content. The book starts with an overly simplistic introduction to Jupiter which consists of nothing more than a brief explanation of each menu option. No discussion of distributions, installation or configuration options. No discussion of loading alternate kernels etc. So it is clearly for beginners. Except the book then launches into and example of using the Black Scholes model for option call pricing. No real explanation of what that is or how it works. So this is a finance book for experts? No because two pages further on and we are doing gamblin analysis in R. Listing after listing on apparently random subjects with no introduction, no explanation of either the code or the concepts. They are irrelevant to Jupiter for the most part and I can't see what they are trying to teach you. Truly a dreadful book - It is the first and only time I wish I had never waster my money on a book. I hope I never meet the author at a party - He must be real hard work to talk to!
Amazon Verified review Amazon
Jesse Lethe Dec 07, 2017
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Text mixes Python and R on the examples. This makes it hard for those knowing only one of the two languages. Furthermore, author jumps into different subjects, expanding too much time on topics that are irrelevant to either Jupyter or Data Sciences.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.