Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning with R

You're reading from   Machine Learning with R R gives you access to the cutting-edge software you need to prepare data for machine learning. No previous knowledge required ‚Äì this book will take you methodically through every stage of applying machine learning.

Arrow left icon
Product type Paperback
Published in Oct 2013
Publisher Packt
ISBN-13 9781782162148
Length 396 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Brett Lantz Brett Lantz
Author Profile Icon Brett Lantz
Brett Lantz
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Machine Learning with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Introducing Machine Learning 2. Managing and Understanding Data FREE CHAPTER 3. Lazy Learning – Classification Using Nearest Neighbors 4. Probabilistic Learning – Classification Using Naive Bayes 5. Divide and Conquer – Classification Using Decision Trees and Rules 6. Forecasting Numeric Data – Regression Methods 7. Black Box Methods – Neural Networks and Support Vector Machines 8. Finding Patterns – Market Basket Analysis Using Association Rules 9. Finding Groups of Data – Clustering with k-means 10. Evaluating Model Performance 11. Improving Model Performance 12. Specialized Machine Learning Topics Index

Index

A

  • = assignment operator / Vectors
  • abline() function / ROC curves
  • abstraction process
    • about / Abstraction and knowledge representation
  • actionable associations / Step 4 – evaluating model performance
  • activation function
    • about / From biological to artificial neurons, Activation functions
    • threshold activation function / Activation functions
    • unit step activation function / Activation functions
    • sigmoid activation function / Activation functions
  • AdaBoost
    • about / Boosting
  • adaptive boosting
    • about / Boosting the accuracy of decision trees
  • aggregate() function / Step 5 – improving model performance
  • aggregate function / Bagging
  • apply() function / Data preparation – creating indicator features for frequent words
  • appropriate k
    • selecting / Choosing an appropriate k
  • Apriori
    • about / The Apriori algorithm for association rule learning
  • apriori() function / Step 3 – training a model on the data
  • Apriori algorithm
    • for association rule learning / The Apriori algorithm for association rule learning
    • strengths / The Apriori algorithm for association rule learning
    • weaknesses / The Apriori algorithm for association rule learning
  • Apriori principle
    • used, for building set of rules / Building a set of rules with the Apriori principle
  • Apriori property
    • about / The Apriori algorithm for association rule learning
  • array
    • about / R data structures, Matrixes and arrays
  • Artificial Neural Network (ANN)
    • about / Understanding neural networks
    • applications / Understanding neural networks
  • association rules
    • about / Understanding association rules
    • potential applications / Understanding association rules
    • rule interest, measuring / Measuring rule interest – support and confidence
    • set of rules, building with Apriori principle / Building a set of rules with the Apriori principle
    • frequently purchased groceries, identifying with / Example – identifying frequently purchased groceries with association rules
  • automated parameter tuning
    • caret package used / Using caret for automated parameter tuning
    • requisites / Using caret for automated parameter tuning
  • axon
    • about / From biological to artificial neurons

B

  • 0.632 bootstrap accounts / Bootstrap sampling
  • backpropagation
    • neural networks, training with / Training neural networks with backpropagation
    • about / Training neural networks with backpropagation
  • backpropagation algorithm
    • strengths / Training neural networks with backpropagation
    • weaknesses / Training neural networks with backpropagation
  • bag() function / Bagging
  • bag-of-words / Step 2 – exploring and preparing the data
  • bagging
    • about / Bagging
  • bagging() function
    • about / Bagging
  • bank loans example, with C5.0 decision trees
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data
    • random training, creating / Data preparation – creating random training and test datasets
    • test datasets, creating / Data preparation – creating random training and test datasets
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • basics concepts, Bayesian methods
    • about / Basic concepts of Bayesian methods
    • probability / Probability
    • joint probability / Joint probability
    • conditional probability / Conditional probability with Bayes' theorem
  • Bayesian classifiers
    • uses / Understanding naive Bayes
  • Bayesian methods
    • about / Understanding naive Bayes
    • basic concepts / Basic concepts of Bayesian methods
  • benefits, machine learning / Uses and abuses of machine learning
  • bias
    • about / Generalization
  • bias-variance tradeoff
    • about / Choosing an appropriate k
  • biganalytics package / Using massive matrices with bigmemory
  • bigkmeans() function / Using massive matrices with bigmemory
  • biglm() function
    • about / Building bigger regression models with biglm
  • biglm package
    • about / Building bigger regression models with biglm
    • regression model, building / Building bigger regression models with biglm
  • bigmemory package
    • about / Using massive matrices with bigmemory
    • URL, for documentation / Using massive matrices with bigmemory
    • massive matrices, using with / Using massive matrices with bigmemory
  • bigrf package
    • about / Growing bigger and faster random forests with bigrf
    • random forests, building / Growing bigger and faster random forests with bigrf
  • bimodal / Measuring the central tendency – the mode
  • binning
    • about / Using numeric features with naive Bayes
  • bins
    • about / Visualizing numeric variables – histograms
  • Bioconductor project
    • about / Working with bioinformatics data
    • URL / Working with bioinformatics data
  • bioinformatics data
    • working with / Working with bioinformatics data
  • bivariate relationships
    • about / Exploring relationships between variables
  • blind tasting experience example / The kNN algorithm
  • body mass index (BMI) / Step 1 – collecting data
  • boosting
    • about / Boosting
  • bootstrap aggregating
    • about / Bagging
  • bootstrap sampling / Bootstrap sampling
  • box-and-whiskers plot
    • about / Visualizing numeric variables – boxplots
  • boxplot
    • about / Visualizing numeric variables – boxplots
  • boxplot() function / Visualizing numeric variables – boxplots
  • branches
    • about / Understanding decision trees
  • breast cancer
    • diagnosing, with kNN algorithm / Diagnosing breast cancer with the kNN algorithm
  • breast cancer example
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance, Transformation – z-score standardization

C

  • c() function / Vectors
  • C5.0 algorithm
    • about / The C5.0 decision tree algorithm
    • strengths / The C5.0 decision tree algorithm
    • weaknesses / The C5.0 decision tree algorithm
    • split, selecting / Choosing the best split
    • decision tree, pruning / Pruning the decision tree
  • caret character / Ordinary least squares estimation
  • caret package
    • using, for automated parameter tuning / Using caret for automated parameter tuning
    • about / Using caret for automated parameter tuning, Training and evaluating models in parallel with caret
  • categorical variables
    • about / Exploring categorical variables
    • exploring / Exploring categorical variables
    • central tendency, measuring / Measuring the central tendency – the mode
  • cbind() function / Multiple linear regression
  • central tendency
    • measuring / Measuring the central tendency – mean and median
  • centroid / Using distance to assign and update clusters
  • characteristics, neural networks
    • activation function / From biological to artificial neurons, Activation functions
    • network topology / From biological to artificial neurons, Network topology, The number of layers, The direction of information travel, The number of nodes in each layer
    • training algorithm / From biological to artificial neurons, Training neural networks with backpropagation
  • character vectors
    • about / Factors
  • Chi-Squared statistic
    • about / Choosing the best split
  • classification
    • about / Thinking about types of machine learning algorithms
    • nearest neighbors used / Understanding classification using nearest neighbors
  • classification performance
    • measuring / Measuring performance for classification
  • classification prediction data
    • working with / Working with classification prediction data in R
  • classification rules
    • about / Understanding classification rules
    • separate-and-conquer / Separate and conquer
    • One Rule algorithm / The One Rule algorithm
    • RIPPER algorithm / The RIPPER algorithm
    • obtaining, from decision trees / Rules from decision trees
  • cluster
    • about / Understanding clustering
  • clustering
    • about / Thinking about types of machine learning algorithms, Understanding clustering
    • applications / Understanding clustering
    • as machine learning task / Clustering as a machine learning task
  • clustering, k-means algorithm
    • about / The k-means algorithm for clustering
    • distance, used for assigning cluster / Using distance to assign and update clusters
    • distance, used for updating cluster / Using distance to assign and update clusters
    • appropriate number of clusters, selecting / Choosing the appropriate number of clusters
  • clusters / Learning faster with parallel computing
  • column-major order
    • about / Matrixes and arrays
  • combine function / Vectors
  • components, machine learning
    • generalization / Generalization
    • success of learning, assessing / Assessing the success of learning
  • components, machine learnng
    • data input / How do machines learn?
    • abstraction / How do machines learn?, Abstraction and knowledge representation
    • generalization / How do machines learn?
    • knowledge representation / Abstraction and knowledge representation
  • Comprehensive R Archive Network (CRAN)
    • about / Using R for machine learning
    / Step 3 – training a model on the data
  • concrete strength, modeling with ANNs
    • about / Modeling the strength of concrete with ANNs
    • data, collecting / Step 1 – collecting data
    • data, preparing / Step 2 – exploring and preparing the data
    • data, exploring / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • conditional probability
    • about / Conditional probability with Bayes' theorem
  • confusion matrix
    • about / A closer look at confusion matrices
    • used, for measuring performance / Using confusion matrices to measure performance
  • contingency table
    • about / Examining relationships – two-way cross-tabulations
  • convex hull / The case of linearly separable data
  • cor() command / Exploring relationships among features – the correlation matrix
  • cor() function / Correlations
  • corpus
    • about / Data preparation – processing text data for analysis
  • Corpus() function / Data preparation – processing text data for analysis
  • correlation / Visualizing relationships – scatterplots
    • about / Correlations
  • correlation ellipse / Visualizing relationships among features – the scatterplot matrix
  • correlation matrix / Exploring relationships among features – the correlation matrix
  • cov() function / Ordinary least squares estimation, Correlations
  • covariance
    • about / Ordinary least squares estimation
  • createDataPartition() function / The holdout method
  • cross-validation / Cross-validation
  • crosstab
    • about / Examining relationships – two-way cross-tabulations
  • CrossTable() function / Examining relationships – two-way cross-tabulations, Using confusion matrices to measure performance
  • CSV files
    • data, importing from / Importing and saving data from CSV files
    • about / Importing and saving data from CSV files
    • loading, into R / Importing and saving data from CSV files
  • CUDA
    • about / GPU computing
  • curve() function / Choosing the best split
  • cut points
    • about / Using numeric features with naive Bayes

D

  • data
    • machine learning algorithm, applying to / Steps to apply machine learning to your data
    • managing, with R / Managing data with R
    • importing, from CSV files / Importing and saving data from CSV files
    • importing, from SQL databases / Importing data from SQL databases
    • about / Working with specialized data
    • obtaining, from web / Getting data from the Web with the RCurl package
  • data.frame() function / Data frames
  • data.table package
    • about / Making data frames faster with data.table
  • data dictionary
    • about / Exploring the structure of data
  • data exploration
    • about / Exploring and understanding data
  • data frame
    • about / R data structures, Data frames
    • making faster, with data.table package / Making data frames faster with data.table
  • data mining
    • about / The origins of machine learning
  • data munging
    • about / Working with specialized data
  • data preparation, breast cancer example
    • training, creating / Data preparation – creating training and test datasets
    • test datasets, creating / Data preparation – creating training and test datasets
  • data structures, R
    • about / R data structures
    • vector / Vectors
    • factor / Factors, Lists
    • data frame / Data frames
    • matrix / Matrixes and arrays
    • array / Matrixes and arrays
    • saving / Saving and loading R data structures
    • loading / Saving and loading R data structures
    • exploring / Exploring the structure of data
  • DBMS
    • about / Importing data from SQL databases
  • decision nodes
    • about / Understanding decision trees
  • decision tree
    • about / Understanding decision trees, Example – identifying risky bank loans using C5.0 decision trees
    • potential uses / Understanding decision trees
    • divide-and-conquer / Divide and conquer
    • pruning / Pruning the decision tree
    • used, for identifying risky bank loans / Example – identifying risky bank loans using C5.0 decision trees, Step 1 – collecting data
    • accuracy, boosting / Boosting the accuracy of decision trees
  • decision tree forests
    • about / Random forests
  • decision trees
    • classification rules, obtaining from / Rules from decision trees
  • deep learning / The direction of information travel
  • delimiter
    • about / Importing and saving data from CSV files
  • dendrites
    • about / From biological to artificial neurons
  • dependent events
    • about / Joint probability
  • dependent variable / Visualizing relationships – scatterplots
    • about / Understanding regression
  • descriptive model
    • about / Thinking about types of machine learning algorithms
  • diff() function / Measuring spread – quartiles and the five-number summary
  • disk-based data frames
    • creating, with ff package / Creating disk-based data frames with ff
  • distance function
    • about / Calculating distance
  • divide-and-conquer
    • about / Divide and conquer
  • DSN
    • about / Importing data from SQL databases
  • dummy coding
    • about / Preparing data for use with kNN
    / Step 3 – training a model on the data

E

  • e1071 package
    • naive Bayes classification, with naiveBayes() function / Step 3 – training a model on the data
  • elbow method / Choosing the appropriate number of clusters
  • elbow point / Choosing the appropriate number of clusters
  • elements
    • about / Vectors
  • ensemble methods
    • bagging / Bagging
    • boosting / Boosting
    • random forests / Random forests
  • ensembles
    • about / Understanding ensembles
    • advantages / Understanding ensembles
  • entropy
    • about / Choosing the best split
  • epoch / Training neural networks with backpropagation
  • ethical considerations, machine learning / Ethical considerations
  • Euclidean distance
    • about / Calculating distance
  • Euclidean norm / The case of linearly separable data
  • events
    • about / Basic concepts of Bayesian methods
  • example
    • about / Thinking about the input data

F

  • 10-fold cross-validation
    • about / Cross-validation
  • F-measure
    • about / The F-measure
  • Facebook / Finding teen market segments using k-means clustering
  • factor
    • about / R data structures, Factors, Lists
    • creating, from character vector / Factors
  • factor() function / Factors
  • feature
    • about / Thinking about the input data
  • feedforward networks
    • about / The direction of information travel
  • ff package
    • about / Creating disk-based data frames with ff
    • used, for creating disk-based data frames / Creating disk-based data frames with ff
  • five-number summary
    • about / Measuring spread – quartiles and the five-number summary
  • foreach package
    • about / Working in parallel with foreach, Training and evaluating models in parallel with caret
  • frequently purchased groceries
    • identifying, with association rules / Example – identifying frequently purchased groceries with association rules
  • future performance
    • estimating / Estimating future performance
  • future performance estimation
    • holdout method / The holdout method
    • cross-validation / Cross-validation
    • bootstrap sampling / Bootstrap sampling

G

  • gain ratio
    • about / Choosing the best split
  • Gaussian Radial Basis Function (RBF) kernel / Using kernels for non-linear spaces
  • generalization
    • about / Generalization
  • generalized linear models (GLM)
    • about / Understanding regression
  • Gini index
    • about / Choosing the best split
  • GPU computing
    • about / GPU computing
  • gputools package
    • about / GPU computing
  • gradient descent
    • about / Training neural networks with backpropagation
  • graph data
    • working with / Working with social network data and graph data
  • greedy learners
    • about / Separate and conquer
  • grid
    • about / Using caret for automated parameter tuning

H

  • Hadoop
    • parallel computing / Parallel cloud computing with MapReduce and Hadoop
  • header line
    • about / Importing and saving data from CSV files
  • heuristics
    • about / Generalization
  • hidden layers
    • about / The number of layers
  • hist() function / Visualizing numeric variables – histograms
  • histogram
    • about / Visualizing numeric variables – histograms
  • holdout method / The holdout method
  • human brain / Understanding neural networks
  • hyperplane / Understanding Support Vector Machines

I

  • imputation / Data preparation – imputing missing values
  • Incremental Reduced Error Pruning algorithm (IREP) / The RIPPER algorithm
  • independent events
    • about / Joint probability
  • independent variables
    • about / Understanding regression
  • information gain / Choosing the best split
  • Input Nodes / The number of layers
  • installation, R package / Installing an R package
  • instance-based learning
    • about / Why is the kNN algorithm lazy?
  • interaction
    • about / Model specification – adding interaction effects
  • intercept
    • about / Understanding regression
  • interquartile range (IQR) / Measuring spread – quartiles and the five-number summary
  • ipred package
    • about / Bagging
  • IQR() function / Measuring spread – quartiles and the five-number summary
  • itemFrequencyPlot() function / Visualizing item support – item frequency plots
  • itemset
    • about / Understanding association rules

J

  • joint probability
    • about / Joint probability
  • JRip() classifier / Step 5 – improving model performance
  • JSON
    • about / Reading and writing JSON with the rjson package
    • reading, rjson package used / Reading and writing JSON with the rjson package
    • writing, rjson package used / Reading and writing JSON with the rjson package
    • converting, to R / Reading and writing JSON with the rjson package
  • JSON format
    • URL / Reading and writing JSON with the rjson package

K

  • k-means algorithm
    • about / The k-means algorithm for clustering
    • strengths / The k-means algorithm for clustering
    • weaknesses / The k-means algorithm for clustering
  • kappa statistic
    • about / The kappa statistic
  • kernels
    • using, for non-linear spaces / Using kernels for non-linear spaces
  • kernel trick / Using kernels for non-linear spaces
  • kernlab package / Bagging
  • kmeans() function / Step 3 – training a model on the data
  • knn() function / Step 3 – training a model on the data
  • kNN algorithm
    • about / The kNN algorithm, Step 3 – training a model on the data
    • strengths / The kNN algorithm
    • weaknesses / The kNN algorithm
    • distance, calculating / Calculating distance
    • appropriate k, selecting / Choosing an appropriate k
    • data, preparing / Preparing data for use with kNN
    • used, for diagnosing breast cancer / Diagnosing breast cancer with the kNN algorithm
  • knowledge representation
    • about / Abstraction and knowledge representation
  • ksvm() function / Bagging

L

  • Laplace estimator
    • about / The Laplace estimator
  • lapply() function / Transformation – normalizing numeric data, Transformation – z-score standardization
  • large datasets
    • managing / Managing very large datasets
  • large datasets management
    • about / Managing very large datasets
    • data frames, making faster with data.table package / Making data frames faster with data.table
    • disk-based data frames, creating with ff package / Creating disk-based data frames with ff
    • massive matrices, using with bigmemory package / Using massive matrices with bigmemory
  • layers
    • about / The number of layers
  • lazy learning algorithms / Why is the kNN algorithm lazy?
  • leaf nodes
    • about / Understanding decision trees
  • learning, with parallel computing
    • about / Learning faster with parallel computing
    • execution time, measuring / Measuring execution time
    • working, in paralle with foreach / Working in parallel with foreach
    • multitasking operating system, using with multicore package / Using a multitasking operating system with multicore
    • multiple workstations, networking / Networking multiple workstations with snow and snowfall
    • parallel cloud computing, with MapReduce / Parallel cloud computing with MapReduce and Hadoop
    • parallel cloud computing, with Hadoop / Parallel cloud computing with MapReduce and Hadoop
  • learning rate / Training neural networks with backpropagation
  • left hand side (LHS) / Step 4 – evaluating model performance
  • levels
    • about / Thinking about types of machine learning algorithms
  • likelihood
    • about / Conditional probability with Bayes' theorem
  • likelihood table
    • about / Conditional probability with Bayes' theorem
  • linear kernel / Using kernels for non-linear spaces
  • linearly separable / Classification with hyperplanes
  • linear regression
    • about / Understanding regression
  • link function
    • about / Understanding regression
  • list() function / Lists
  • lists
    • about / R data structures
  • lm() function
    • about / Building bigger regression models with biglm
  • load() command / Saving and loading R data structures
  • loess smooth / Visualizing relationships among features – the scatterplot matrix
  • logistic regression
    • about / Understanding regression

M

  • M5' algorithm (M5-prime) / Step 5 – improving model performance
  • machine learning
    • origins / The origins of machine learning
    • benefits / Uses and abuses of machine learning
    • ethical considerations / Ethical considerations
    • about / How do machines learn?
    • applying, to data / Steps to apply machine learning to your data
    • R, using / Using R for machine learning
  • machine learning algorithm
    • about / The origins of machine learning, Uses and abuses of machine learning
    • selecting / Choosing a machine learning algorithm
    • data, matching / Matching your data to an appropriate algorithm
  • machine learning algorithms
    • input training data / Thinking about the input data
    • types / Thinking about types of machine learning algorithms
  • Manhattan distance
    • about / Calculating distance
  • MapReduce programming model
    • about / Parallel cloud computing with MapReduce and Hadoop
    • parallel computing / Parallel cloud computing with MapReduce and Hadoop
  • marginal likelihood
    • about / Conditional probability with Bayes' theorem
  • market basket analysis example
    • data, collecting / Step 1 – collecting data
    • data, preparing / Step 2 – exploring and preparing the data
    • data, exploring / Step 2 – exploring and preparing the data
    • sparse matrix, creating for transaction data / Data preparation – creating a sparse matrix for transaction data
    • item support, visualizing / Visualizing item support – item frequency plots
    • transaction data, visualizing / Visualizing transaction data – plotting the sparse matrix
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
    • set of association rules, sorting / Sorting the set of association rules, Taking subsets of association rules
    • association rules, saving to file / Saving association rules to a file or data frame
  • massive matrices
    • using, with bigmemory package / Using massive matrices with bigmemory
  • matrix
    • about / Matrixes and arrays
  • matrix() function / Matrixes and arrays
  • Maximum Margin Hyperplane (MMH)
    • about / Finding the maximum margin
    • case, of linearly separable data / The case of linearly separable data
    • case, of non-linearly separable data / The case of non-linearly separable data
  • mcapply() function / Using a multitasking operating system with multicore
  • mean
    • about / Measuring the central tendency – mean and median
  • mean() function / Measuring the central tendency – mean and median, Ordinary least squares estimation
  • mean absolute error (MAE) / Measuring performance with mean absolute error
  • median
    • about / Measuring the central tendency – mean and median
  • median() function / Measuring the central tendency – mean and median
  • medical expenses, predicting with linear regression
    • about / Example – predicting medical expenses using linear regression
    • data, collecting / Step 1 – collecting data
    • data, preparing / Step 2 – exploring and preparing the data
    • data, exploring / Step 2 – exploring and preparing the data
    • correlation matrix / Exploring relationships among features – the correlation matrix
    • relationships, exploring among features / Exploring relationships among features – the correlation matrix
    • relationships, visualizing among features / Visualizing relationships among features – the scatterplot matrix
    • scatterplot matrix / Visualizing relationships among features – the scatterplot matrix
    • model, training on data / Step 3 – training a model on the data
    • model performance, training / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance, Transformation – converting a numeric variable to a binary indicator, Putting it all together – an improved regression model
  • meta-learning methods
    • about / Improving model performance with meta-learning
    • used, for improving model performance / Improving model performance with meta-learning
  • Microsoft Excel / Importing and saving data from CSV files
  • Microsoft Excel spreadsheets
    • reading, xlsx package used / Reading and writing Microsoft Excel spreadsheets using xlsx
    • writing, xlsx package used / Reading and writing Microsoft Excel spreadsheets using xlsx
  • Microsoft SQL
    • about / Importing data from SQL databases
  • min-max normalization
    • about / Preparing data for use with kNN
  • Mobile Phone Spam
    • filtering, with naive Bayes algorithm / Example – filtering mobile phone spam with the naive Bayes algorithm
  • Mobile Phone Spam example
    • data, collecting / Step 1 – collecting data
    • data, preparing / Step 2 – exploring and preparing the data
    • data, exploring / Step 2 – exploring and preparing the data
    • text data, processing for analysis / Data preparation – processing text data for analysis
    • training, creating / Data preparation – creating training and test datasets
    • test datasets, creating / Data preparation – creating training and test datasets
    • text data, visualizing / Visualizing text data – word clouds
    • indicator features, creating for frequent words / Data preparation – creating indicator features for frequent words
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • mode
    • about / Measuring the central tendency – the mode
  • mode() function / Measuring the central tendency – the mode
  • model
    • about / Abstraction and knowledge representation
  • model performance
    • improving, with meta-learning / Improving model performance with meta-learning
  • model performance, breast cancer example
    • z-score standardization / Transformation – z-score standardization
    • alternatives values, testing of k / Testing alternative values of k
  • model trees
    • about / Understanding regression trees and model trees
  • multicore package
    • about / Using a multitasking operating system with multicore
  • multidimensional feature space / The kNN algorithm
  • multilayer network
    • about / The number of layers
  • Multilayer Perceptron (MLP)
    • about / The direction of information travel
  • multimodal / Measuring the central tendency – the mode
  • multiple linear regression
    • about / Multiple linear regression
    • strengths / Multiple linear regression
    • weaknesses / Multiple linear regression
  • multiple workstations
    • networking, with snow package / Networking multiple workstations with snow and snowfall
    • networking, with snowfall package / Networking multiple workstations with snow and snowfall
  • multitasking operating system
    • using, with multicore package / Using a multitasking operating system with multicore
  • multivariate relationships
    • about / Exploring relationships between variables
  • MySpace / Finding teen market segments using k-means clustering
  • MySQL
    • about / Importing data from SQL databases

N

  • naive Bayes
    • numeric features, using with / Using numeric features with naive Bayes
  • naive Bayes algorithm
    • about / Understanding naive Bayes, The naive Bayes algorithm
    • strengths / The naive Bayes algorithm
    • weaknesses / The naive Bayes algorithm
    • naive Bayes classification / The naive Bayes classification
    • Laplace estimator / The Laplace estimator
    • used, for filtering Mobile Phone Spam / Example – filtering mobile phone spam with the naive Bayes algorithm
  • naive Bayes classification
    • about / The naive Bayes classification
    • naiveBayes() function, using in e1071 package / Step 3 – training a model on the data
  • nearest neighbor classifiers
    • about / Understanding classification using nearest neighbors
  • network package
    • about / Working with social network data and graph data
    • URL, for info / Working with social network data and graph data
  • network topology
    • about / Network topology
    • number of layers / The number of layers
    • direction, of information travel / The direction of information travel
    • number of nodes, in each layer / The number of nodes in each layer
  • neural networks
    • about / Understanding neural networks
    • biological, to artificial neurons / From biological to artificial neurons
    • characteristics / From biological to artificial neurons
    • training, with backpropagation / Training neural networks with backpropagation
  • neurons
    • about / Understanding neural networks
  • No Free Lunch theorem
    • about / Choosing a machine learning algorithm
  • nominal variables
    • about / Factors
  • non-linearly separable data / The case of non-linearly separable data
  • non-linear spaces
    • kernels, using for / Using kernels for non-linear spaces
  • normal distributions / Understanding numeric data – uniform and normal distributions
  • normalize() function / Transformation – normalizing numeric data
  • numeric data
    • about / Understanding numeric data – uniform and normal distributions
    • normalizing / Transformation – normalizing numeric data
  • numeric features
    • using, with naive Bayes / Using numeric features with naive Bayes
  • numeric prediction
    • about / Thinking about types of machine learning algorithms
  • numeric variables
    • about / Exploring numeric variables
    • exploring / Exploring numeric variables
    • central tendency, measuring / Measuring the central tendency – mean and median
    • spread, measuring / Measuring spread – quartiles and the five-number summary
    • visualizing / Visualizing numeric variables – boxplots, Visualizing numeric variables – histograms

O

  • <- operator / Vectors
  • OCR, performing with SVMs
    • about / Performing OCR with SVMs
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • ODBC
    • about / Importing data from SQL databases
  • odbcConnect() function / Importing data from SQL databases
  • one-way table / Exploring categorical variables
  • One Rule algorithm
    • about / The One Rule algorithm
    • strengths / The One Rule algorithm
    • weaknesses / The One Rule algorithm
  • optimized learning algorithms
    • deploying / Deploying optimized learning algorithms
  • optimized learning algorithms deployment
    • regression models, building with biglm package / Building bigger regression models with biglm
    • random forests, building with bigrf package / Growing bigger and faster random forests with bigrf
    • caret package, used for evaluating models in parallel / Training and evaluating models in parallel with caret
  • Oracle
    • about / Importing data from SQL databases
  • order() function / Data preparation – creating random training and test datasets
  • ordinary least squares (OLS) / Ordinary least squares estimation
  • ordinary least squares estimation
    • about / Ordinary least squares estimation
  • out-of-bag error rate
    • about / Training random forests
  • Output Node / The number of layers
  • overfitting
    • about / Assessing the success of learning

P

  • pairs() function / Visualizing relationships among features – the scatterplot matrix
  • parallel computing methods
    • about / Learning faster with parallel computing
  • parameter estimates
    • about / Simple linear regression
  • parameter tuning
    • about / Tuning stock models for better performance
  • pattern discovery
    • about / Thinking about types of machine learning algorithms
  • Pearson's Chi-squared test / Examining relationships – two-way cross-tabulations
  • Pearson's correlation
    • about / Correlations
  • performance
    • measuring, confusion matrices used / Using confusion matrices to measure performance
    • improving, of R / Improving the performance of R
  • performance() function / ROC curves
  • performance measures
    • about / Beyond accuracy – other measures of performance
    • kappa statistic / The kappa statistic
    • sensitivity / Sensitivity and specificity
    • specificity / Sensitivity and specificity
    • precision / Precision and recall
    • recall / Precision and recall
    • F-measure / The F-measure
  • performance tradeoffs
    • visualizing / Visualizing performance tradeoffs
  • plot() command / ROC curves
  • plot() function / Visualizing relationships – scatterplots
  • point-and-click interface
    • used, for installing R package / Installing a package using the point-and-click interface
  • poisonous mushrooms
    • identifying, with rule learners / Example – identifying poisonous mushrooms with rule learners
  • poisonous mushrooms example, with rule learners
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • Poisson regression
    • about / Understanding regression
  • polynomial kernel / Using kernels for non-linear spaces
  • posPredValue() function / Precision and recall
  • posterior probability
    • about / Conditional probability with Bayes' theorem
  • PostgreSQL
    • about / Importing data from SQL databases
  • postpruning
    • about / Pruning the decision tree
  • precision
    • about / Precision and recall
  • pred function / Bagging
  • predict() function / Working with classification prediction data in R, Creating a simple tuned model
    • about / Bagging
  • predictive model
    • about / Thinking about types of machine learning algorithms
  • prepruning
    • about / Pruning the decision tree
  • prior probability
    • about / Conditional probability with Bayes' theorem
  • probability
    • about / Probability

Q

  • quadratic optimization / The case of linearly separable data
  • quantile() function / Measuring spread – quartiles and the five-number summary
  • quartiles
    • about / Measuring spread – quartiles and the five-number summary

R

  • 68-95-99.7 rule / Measuring spread – variance and standard deviation
  • R
    • using, for machine learning / Using R for machine learning
    • data structures / R data structures
    • used, for managing data / Managing data with R
    • CSV file, loading into / Importing and saving data from CSV files
    • working with classification prediction data / Working with classification prediction data in R
    • JSON, converting to / Reading and writing JSON with the rjson package
    • performance, improving / Improving the performance of R
  • Radial Basis Function (RBF) network
    • about / Activation functions
  • randomForest() function
    • about / Training random forests
    / Evaluating random forest performance
  • randomForest package / Training random forests
  • random forests
    • about / Random forests
    • strengths / Random forests
    • weaknesses / Random forests
    • training / Training random forests
    • performance, evaluating / Evaluating random forest performance
  • range
    • about / Measuring spread – quartiles and the five-number summary
  • range() function / Measuring spread – quartiles and the five-number summary
  • RCurl package
    • about / Getting data from the Web with the RCurl package
    • used, for obtaining data from web / Getting data from the Web with the RCurl package
    • URL, for documentation / Getting data from the Web with the RCurl package
  • real-world data
    • about / Working with specialized data
  • recall / Precision and recall
  • recurrent network
    • about / The direction of information travel
  • recursive partitioning
    • about / Divide and conquer
  • reg() function / Multiple linear regression
  • regression
    • about / Understanding regression
    • simple linear regression / Simple linear regression
    • ordinary least squares estimation / Ordinary least squares estimation
    • correlation / Correlations
    • multiple linear regression / Multiple linear regression
    • adding, to trees / Adding regression to trees
  • regression analysis
    • use cases / Understanding regression
  • regression equations
    • about / Understanding regression
  • regression models
    • building, with biglm package / Building bigger regression models with biglm
  • regression trees
    • about / Understanding regression trees and model trees
    • strengths / Adding regression to trees
    • weaknesses / Adding regression to trees
  • relationships
    • exploring, between variables / Exploring relationships between variables
    • visualizing / Visualizing relationships – scatterplots
    • examining / Examining relationships – two-way cross-tabulations
  • residuals
    • about / Ordinary least squares estimation
  • resubstitution error / Estimating future performance
  • RHIPE package / Parallel cloud computing with MapReduce and Hadoop
  • right hand side (RHS) / Step 4 – evaluating model performance
  • RIPPER algorithm
    • about / The RIPPER algorithm
    • strengths / The RIPPER algorithm
    • weaknesses / The RIPPER algorithm
  • risky bank loans
    • identifying, C5.0 decision trees used / Example – identifying risky bank loans using C5.0 decision trees, Step 1 – collecting data
  • rjson package
    • about / Reading and writing JSON with the rjson package
    • used, for reading JSON / Reading and writing JSON with the rjson package
    • used, for writing JSON / Reading and writing JSON with the rjson package
  • rmr package
    • about / Parallel cloud computing with MapReduce and Hadoop
  • ROC curve
    • about / ROC curves
    • creating / ROC curves
  • ROCR package
    • about / Visualizing performance tradeoffs
  • RODBC package
    • about / Importing data from SQL databases
  • rote learning
    • about / Why is the kNN algorithm lazy?
  • round() function / Exploring categorical variables
  • R package
    • installing / Installing an R package
    • installing, point-and-click interface used / Installing a package using the point-and-click interface
    • loading / Loading an R package
  • R performance
    • large datasets, managing / Managing very large datasets
    • learning, with parallel computing / Learning faster with parallel computing
    • GPU computing / GPU computing
    • optimized learning algorithms, deploying / Deploying optimized learning algorithms
  • rudimentary ANNs / Understanding neural networks
  • runif() function / Data preparation – creating random training and test datasets
  • RWeka package
    • using / Installing and loading R packages
    • loading / Loading an R package
    / The C5.0 decision tree algorithm

S

  • save() function / Saving and loading R data structures
  • scale() function / Transformation – z-score standardization
  • scatterplot
    • about / Visualizing relationships – scatterplots
  • Scoville scale
    • about / Preparing data for use with kNN
  • sd() function / Measuring spread – variance and standard deviation, Correlations
  • semi-supervised learning
    • about / Clustering as a machine learning task
  • sensitivity() function / Precision and recall
  • sensor / The origins of machine learning
  • separate-and-conquer
    • about / Separate and conquer
  • seq() function / Measuring spread – quartiles and the five-number summary
  • Short Message Service (SMS) / Example – filtering mobile phone spam with the naive Bayes algorithm
  • sigmoid activation function
    • about / Activation functions
  • sigmoid kernel / Using kernels for non-linear spaces
  • simple linear regression
    • about / Simple linear regression
  • simple tuned model
    • creating / Creating a simple tuned model
  • single-layer network
    • about / The number of layers
  • skew / Visualizing numeric variables – histograms
  • slack variable / The case of non-linearly separable data
  • slope
    • about / Understanding regression
  • sna package
    • URL, for info / Working with social network data and graph data
  • snowfall package
    • multiple workstations, networking / Networking multiple workstations with snow and snowfall
  • snow package
    • about / Networking multiple workstations with snow and snowfall
    • multiple workstations, networking / Networking multiple workstations with snow and snowfall
  • social network data
    • working with / Working with social network data and graph data
  • Social Networking Service (SNS) / Finding teen market segments using k-means clustering
  • sparse matrix
    • about / Data preparation – processing text data for analysis, Data preparation – creating a sparse matrix for transaction data
    • creating, for transaction data / Data preparation – creating a sparse matrix for transaction data
  • specialized data
    • working with / Working with specialized data
  • SQL databases
    • data, importing from / Importing data from SQL databases
  • SQLite
    • about / Importing data from SQL databases
  • sqlQuery() function / Importing data from SQL databases
  • stacking
    • about / Understanding ensembles
  • standard deviation
    • about / Measuring spread – variance and standard deviation
  • standard deviation reduction (SDR) / Adding regression to trees
  • stock models
    • tuning, for better performance / Tuning stock models for better performance
  • stop words
    • about / Data preparation – processing text data for analysis
  • str() function
    • about / Exploring the structure of data
    / Step 2 – exploring and preparing the data
  • stringsAsFactors option / Data frames
  • subset() function / Working with classification prediction data in R
  • summary() function / Exploring numeric variables
  • summary statistics
    • about / Exploring numeric variables
  • Sum of Squared Errors (SSE) / Step 3 – training a model on the data
  • supervised learning
    • about / Thinking about types of machine learning algorithms
  • support vector machine (SVM)
    • about / Bagging
  • Support Vector Machine (SVM)
    • about / Understanding Support Vector Machines
    • applications / Understanding Support Vector Machines
    • classifications, with hyperplanes / Classification with hyperplanes
    • maximum margin, finding / Finding the maximum margin
    • OCR, performing with / Performing OCR with SVMs
  • support vectors / Finding the maximum margin
  • synapse
    • about / From biological to artificial neurons

T

  • Tab-Separated Value (TSV)
    • about / Importing and saving data from CSV files
  • table() function / Exploring categorical variables, Using confusion matrices to measure performance
  • target feature
    • about / Thinking about types of machine learning algorithms
  • teen market segments serach, with k-means clustering
    • about / Finding teen market segments using k-means clustering
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data, Data preparation – dummy coding missing values, Data preparation – imputing missing values
    • model, training on data / Step 3 – training a model on the data
    • model performance, evaluating / Step 4 – evaluating model performance
    • model performance, improving / Step 5 – improving model performance
  • threshold activation function
    • about / Activation functions
  • tm package / Data preparation – processing text data for analysis
  • token
    • about / Data preparation – processing text data for analysis
  • tokenization
    • about / Data preparation – processing text data for analysis
  • topology
    • about / Network topology
  • train() function / Using caret for automated parameter tuning, Creating a simple tuned model
  • trainControl() function / Customizing the tuning process
  • training
    • about / Abstraction and knowledge representation
  • transaction data
    • sparse matrix, creating for / Data preparation – creating a sparse matrix for transaction data
  • transpose
    • about / Multiple linear regression
  • trees
    • regression, adding to / Adding regression to trees
  • tree structure
    • about / Understanding decision trees
  • trial
    • about / Basic concepts of Bayesian methods
  • trivial rules / Step 4 – evaluating model performance
  • tuning process
    • customizing / Customizing the tuning process
  • Turing test
    • about / Understanding neural networks
  • two-way cross-tabulation
    • about / Examining relationships – two-way cross-tabulations

U

  • UCI Machine Learning Data Repository
    • URL / Step 1 – collecting data, Step 1 – collecting data
    • about / Step 1 – collecting data
  • uniform distribution / Understanding numeric data – uniform and normal distributions
  • unimodal / Measuring the central tendency – the mode
  • unit of observation phrase / Thinking about the input data
  • unit step activation function
    • about / Activation functions
  • univariate statistics
    • about / Exploring relationships between variables
  • universal function approximator
    • about / The number of nodes in each layer
  • unsupervised classification
    • about / Clustering as a machine learning task
  • unsupervised learning
    • about / Thinking about types of machine learning algorithms
  • usedcars.csv dataset
    • about / Exploring and understanding data

V

  • var() function / Measuring spread – variance and standard deviation, Ordinary least squares estimation
  • variables
    • relationships, exploring between / Exploring relationships between variables
  • variance
    • about / Measuring spread – variance and standard deviation
  • vector
    • about / R data structures, Vectors
  • vector types
    • integer / Vectors
    • numeric / Vectors
    • character / Vectors
    • logical / Vectors
  • Venn diagram
    • about / Joint probability
  • Voronoi diagram / Using distance to assign and update clusters

W

  • web
    • data, obtaining from / Getting data from the Web with the RCurl package
  • weighted voting process
    • about / Choosing an appropriate k
  • wine quality estimation, with regression trees
    • about / Example – estimating the quality of wines with regression trees and model trees
    • data, collecting / Step 1 – collecting data
    • data, exploring / Step 2 – exploring and preparing the data
    • data, preparing / Step 2 – exploring and preparing the data
    • model, training on data / Step 3 – training a model on the data
    • decision trees, visualizing / Visualizing decision trees
    • model performance, evaluating / Step 4 – evaluating model performance
    • performance, measuring with mean absolute error / Measuring performance with mean absolute error
    • model performance, improving / Step 5 – improving model performance
  • word cloud
    • about / Visualizing text data – word clouds

X

  • xlsx package
    • about / Reading and writing Microsoft Excel spreadsheets using xlsx
    • used, for reading Microsoft Excel spreadsheets / Reading and writing Microsoft Excel spreadsheets using xlsx
    • used, for writing Microsoft Excel spreadsheets / Reading and writing Microsoft Excel spreadsheets using xlsx
    • URL / Reading and writing Microsoft Excel spreadsheets using xlsx
  • XML
    • about / Reading and writing XML with the XML package
    • reading, XML package used / Reading and writing XML with the XML package
    • writing, XML package used / Reading and writing XML with the XML package
  • XML package
    • about / Reading and writing XML with the XML package
    • used, for reading XML / Reading and writing XML with the XML package
    • used, for writing XML / Reading and writing XML with the XML package
    • URL, for info / Reading and writing XML with the XML package

Z

  • z-score standardization
    • about / Preparing data for use with kNN
  • ZeroR
    • about / The One Rule algorithm
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image