Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Machine Learning with R, Second Edition
Mastering Machine Learning with R, Second Edition

Mastering Machine Learning with R, Second Edition: Advanced prediction, algorithms, and learning methods with R 3.x , Second Edition

eBook
£22.99 £32.99
Paperback
£41.99
Subscription
Free Trial
Renews at £16.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. £16.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering Machine Learning with R, Second Edition

A Process for Success

"If you don't know where you are going, any road will get you there."
                                                                                                             - Robert Carrol
"If you can't describe what you are doing as a process, you don't know what you're doing."
                                                                                                             - W. Edwards Deming

At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning (specifically, its implementation and making change happen). The smartest people, best software, and best algorithms do not guarantee success, no matter how well it is defined.

In most, if not all, projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the softer, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skills. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, the goal of this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a change agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of all processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.

I will not hesitate to say that this all is easier said than done; without question, I'm guilty of every sin (both commission and omission) that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 12 years.

Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodologies to apply to the problem at hand.

The process

The CRISP-DM process was designed specifically for data mining. However, it is flexible and thorough enough to be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous lists of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops that make it so flexible:

Figure 1: CRISP-DM 1.0, Step-by-step data mining guide

The process has the following six phases:

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Evaluation
  • Deployment

For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.

I will discuss each of the steps in the process, covering the important tasks. However, it will not be in as detailed as the guide, but more high-level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that these process steps will be used in later chapters as a framework in the actual application of the machine-learning methods in general and the R code, in particular.

Business understanding

One cannot underestimate how important this first step in the process is in achieving success. It is the foundational step, and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:

  1. Identifying the business objective.
  2. Assessing the situation.
  3. Determining analytical goals.
  4. Producing a project plan.

Identifying the business objective

The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, "What are we going to do different?" This may seem like a benign question, but it can really challenge people to work out what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some kind of "fishing expedition." As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.

This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not, or cannot be, well defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known-unknowns. Indeed, there will probably be many times when the problem is ill defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown-unknowns, which means that you don't know what you don't know. However, with ill-defined problems, one could go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes from hypothesis exploration.

Another thing to consider in this task is the management of expectations. There is no such thing as perfect data, no matter what its depth and breadth are. This is not the time to make guarantees but to communicate what is possible, given your expertise.

I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that, in the absence of clear direction or guidance, the mission statement, or whatever you want to call it, becomes the unifying statement for all stakeholders and can help prevent scope creep. It consists of the following points:

  • Who: This is yourself or the team or project name; everyone likes a cool project name, for example, Project Viper, Project Fusion, and so on
  • What: This is the task that you will perform, for example, conducting machine learning
  • When: This is the deadline
  • Where: This could be geographical, by function, department, initiative, and so on
  • Why: This is the purpose behind implementing the project, that is, the business goal

The second task is to have as clear a definition of success as possible. Literally, ask "What does success look like?" Help the team/sponsor paint a picture of success that you can understand. Your job then is to translate this into modeling requirements.

Assessing the situation

This task helps you in project planning by gathering information on the resources available, constraints, and assumptions; identifying the risks; and building contingency plans. I would further add that this is also the time to identify the key stakeholders that will be impacted by the decision(s) to be made.

A couple of points here. When examining the resources that are available, do not neglect to scour the records of past and current projects. Odds are someone in the organization has worked, or is working on the same problem and it may be essential to synchronize your work with theirs. Don't forget to enumerate the risks considering time, people, and money. Do everything in your power to create a list of stakeholders, both those that impact your project and those that could be impacted by your project. Identify who these people are and how they can influence/be impacted by the decision. Once this is done, work with the project sponsor to formulate a communication plan with these stakeholders.

Determining the analytical goals

Here, you are looking to translate the business goal into technical requirements. This includes turning the success criterion from the task of creating a business objective to technical success. This might be things such as RMSE or a level of predictive accuracy.

Producing a project plan

The task here is to build an effective project plan with all the information gathered up to this point. Regardless of what technique you use, whether it be a Gantt chart or some other graphic, produce it and make it a part of your communication plan. Make this plan widely available to the stakeholders and update it on a regular basis and as circumstances dictate.

Data understanding

After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:

  1. Collecting the data.
  2. Describing the data.
  3. Exploring the data.
  4. Verifying the data quality.

This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.

Data preparation

Almost there! This step has the following five tasks:

  1. Selecting the data.
  2. Cleaning the data.
  3. Constructing the data.
  4. Integrating the data.
  5. Formatting the data.

These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unmitigated burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the payoff, where you earn your money.

Modeling

This is where all the work that you've done up to this point can lead to fist-pumping exuberance or fist-pounding exasperation. But hey, if it was that easy, everyone would be doing it. The tasks are as follows:

  1. Selecting a modeling technique.
  2. Generating a test design.
  3. Building a model.
  4. Assessing a model.

Oddly, this process step includes the considerations that you have already thought of and prepared for. In the first step, you will need at least some idea about how you will be modeling. Remember that this is a flexible, iterative process and not some strict linear flowchart such as an aircrew checklist.

The cheat sheet included in this chapter should help guide you in the right direction for the modeling techniques. Test design refers to the creation of your test and train datasets and/or the use of cross-validation and this should have been thought of and accounted for in the data preparation.

Model assessment involves comparing the models with the criteria/criterion that you developed in the business understanding, for example, RMSE, Lift, ROC, and so on.

Evaluation

With the evaluation process, the main goal is to confirm that the model selected at this point meets the business objective. Ask yourself and others, "Have we achieved our definition of success?". Let the Netflix prize serve as a cautionary tale here. I'm sure you are aware that Netflix awarded a $1-million prize to the team that could produce the best recommendation algorithm as defined by the lowest RMSE. However, Netflix did not implement it because the incremental accuracy gained was not worth the engineering effort! Always apply Occam's razor. At any rate, here are the tasks:

  1. Evaluating the results.
  2. Reviewing the process.
  3. Determining the next steps.

In reviewing the process, it may be necessary, as you no doubt determined earlier in the process, to take the results through governance and communicate with the other stakeholders in order to gain their buy-in. As for the next steps, if you want to be a change agent, make sure that you answer the what, so what, and now what in the stakeholders' minds. If you can tie their now what into the decision that you made earlier, you have earned your money.

Deployment

If everything is done according to the plan up to this point, it might just come down to flipping a switch and your model goes live. Assuming that this is not the case, here are the tasks for this step:

  1. Deploying the plan.
  2. Monitoring and maintaining the plan.
  3. Producing the final report.
  4. Reviewing the project.

After the deployment and monitoring/maintenance and underway, it is crucial for you and those who will walk in your steps to produce a well-written final report. This report should include a white paper and briefing slide. I have to say that I resisted the drive to put my findings in a white paper as I was an indentured servant to the military's passion for PowerPoint slides. However, slides can and will be used against you, cherry-picked or misrepresented by various parties for their benefit. Trust me, that just doesn't happen with a white paper as it becomes an extension of your findings and beliefs. Use PowerPoint to brief stakeholders, but use that the white paper as the document of record and as a preread, should your organization insist on one. It is my standard procedure to create this white paper in R using knitr and LaTex.

Now for the all-important process review, you may have your own proprietary way of conducting it; but here is what it should cover, whether you conduct it in a formal or informal way:

  • What was the plan?
  • What actually happened?
  • Why did it happen or not happen?
  • What should be sustained in future projects?
  • What should be improved upon in future projects?
  • Create an action plan to ensure sustainment and improvement happen

That concludes the review of the CRISP-DM process, which provides a comprehensive and flexible framework to guarantee the success of your project and make you an agent of change.

Algorithm flowchart

The purpose of this section is to create a tool that will help you not just select possible modeling techniques but also think deeper about the problem. The residual benefit is that it may help you frame the problem with the project sponsor/team. The techniques in the flowchart are certainly not comprehensive but are exhaustive enough to get you started. It also includes techniques not discussed in this book.

The following figure starts the flow of selecting the potential modeling techniques. As you answer the question(s), it will take you to one of the four additional charts:

Figure 2

If the data is text or in the time series format, then you will follow the flow in the following figure:

Figure 3

In this branch of the algorithm, you do not have text or time series data. You also do not want to predict a category, so you are looking to make recommendations, understand associations, or predict a quantity:

Figure 4

To get to this section, you will have data that is not text or time series. You want to categorize the data, but it does not have an outcome label, which brings us to clustering methods, as follows:

Figure 5

This brings us to the situation where we want to categorize the data and it is labeled, that is, classification:

Figure 6

Summary

This chapter was about how to set up you and your team for success in any project that you tackle. The CRISP-DM process is put forward as a flexible and comprehensive framework in order to facilitate the softer skills of communication and influence. Each step of the process and the tasks in each step were enumerated. More than that, the commentary provides some techniques and considerations to with the process execution. By taking heed of the process, you can indeed become an agent of positive change to any organization.

The other item put forth in this chapter was an algorithm flowchart; a cheat sheet to help identify some of the proper techniques to apply in order to solve the business problem. With this foundation in place, we can now move on to applying these techniques to real-world problems.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand and apply machine learning methods using an extensive set of R packages such as XGBOOST
  • Understand the benefits and potential pitfalls of using machine learning methods such as Multi-Class Classification and Unsupervised Learning
  • Implement advanced concepts in machine learning with this example-rich guide

Description

This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2. You will delve into statistical learning theory and supervised learning; design efficient algorithms; learn about creating Recommendation Engines; use multi-class classification and deep learning; and more. You will explore, in depth, topics such as data mining, classification, clustering, regression, predictive modeling, anomaly detection, boosted trees with XGBOOST, and more. More than just knowing the outcome, you’ll understand how these concepts work and what they do. With a slow learning curve on topics such as neural networks, you will explore deep learning, and more. By the end of this book, you will be able to perform machine learning with R in the cloud using AWS in various scenarios with different datasets.

Who is this book for?

This book is for data science professionals, data analysts, or anyone with a working knowledge of machine learning, with R who now want to take their skills to the next level and become an expert in the field.

What you will learn

  • Gain deep insights into the application of machine learning tools in the industry
  • Manipulate data in R efficiently to prepare it for analysis
  • Master the skill of recognizing techniques for effective visualization of data
  • Understand why and how to create test and training data sets for analysis
  • Master fundamental learning methods such as linear and logistic regression
  • Comprehend advanced learning methods such as support vector machines
  • Learn how to use R in a cloud service such as Amazon

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Apr 24, 2017
Length: 420 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787287471
Category :
Languages :

What do you get with a Packt Subscription?

Free for first 7 days. £16.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Apr 24, 2017
Length: 420 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787287471
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
£16.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
£169.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just £5 each
Feature tick icon Exclusive print discounts
£234.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just £5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total £ 151.97
R: Predictive Analysis
£67.99
Mastering Machine Learning with R, Second Edition
£41.99
Machine Learning with R Cookbook, Second Edition
£41.99
Total £ 151.97 Stars icon
Banner background image

Table of Contents

16 Chapters
A Process for Success Chevron down icon Chevron up icon
Linear Regression - The Blocking and Tackling of Machine Learning Chevron down icon Chevron up icon
Logistic Regression and Discriminant Analysis Chevron down icon Chevron up icon
Advanced Feature Selection in Linear Models Chevron down icon Chevron up icon
More Classification Techniques - K-Nearest Neighbors and Support Vector Machines Chevron down icon Chevron up icon
Classification and Regression Trees Chevron down icon Chevron up icon
Neural Networks and Deep Learning Chevron down icon Chevron up icon
Cluster Analysis Chevron down icon Chevron up icon
Principal Components Analysis Chevron down icon Chevron up icon
Market Basket Analysis, Recommendation Engines, and Sequential Analysis Chevron down icon Chevron up icon
Creating Ensembles and Multiclass Classification Chevron down icon Chevron up icon
Time Series and Causality Chevron down icon Chevron up icon
Text Mining Chevron down icon Chevron up icon
R on the Cloud Chevron down icon Chevron up icon
R Fundamentals Chevron down icon Chevron up icon
Sources Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.8
(4 Ratings)
5 star 0%
4 star 50%
3 star 0%
2 star 25%
1 star 25%
Bluebird Sep 03, 2017
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Yes the book is worth reading. The only con is black and white pic.. Which is not too bad.apart, the book is not for starters but for the people who what deep understanding of ML. It do not contain a to z but what ever it cover it is good
Amazon Verified review Amazon
Nick P Jan 31, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Highly recommend it to any student taking a finance or statics class.
Amazon Verified review Amazon
Amazon Customer May 24, 2017
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
The book was not up to the expectations. They used very cheap paper printing is not good and also some pages are not visible to read.Content wise is also not no real time data sets used all they used toy data sets no clear explanation also as the name says "Mastering" but its notI bought "Machine learning with r from Packt" publications.That was very good.setting the expectaions from packt i bought but it is not worth.
Amazon Verified review Amazon
Jose Luis Oct 05, 2017
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Sometimes, the code in the book doesn't work (R shows error messages and stop running the code) because the data doesn't meet the function requirements.Charts are duplicated / missed what makes not possible to follow the examples properly.This book can be used to get notions about the process but not not master ML with R unless you have R and stats knowledge that enables you to rewrite some code.I have been sending the erratas to the publisher without any respond so far.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.