Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Principles of Strategic Data Science

You're reading from   Principles of Strategic Data Science Creating value from data, big and small

Arrow left icon
Product type Paperback
Published in Jun 2019
Publisher
ISBN-13 9781838985295
Length 104 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Peter Prevos Peter Prevos
Author Profile Icon Peter Prevos
Peter Prevos
Arrow right icon
View More author details
Toc

The Elements of Data Science

Now that we have defined data science within the context of managing a business, we can start describing the elements of data science. The best way to unpack the art and craft of data science is Drew Conway's often-cited Venn diagram, as shown in Figure 1.3. (Conway, D. (2010). (The data science Venn diagramhttp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Downloaded 27 January 2019)

Conway defines three competencies that a data scientist, or a data science team as a collective, need to possess. The diagram positions data science as an interdisciplinary activity with three dimensions: domain knowledge, mathematics, and computer science. A data scientist is somebody who understands the subject matter under consideration in mathematical terms and writes computer code to solve problems.

Figure 1.3: Conway’s data science Venn diagram
Figure 1.3: Conway's data science Venn diagram

Domain Knowledge

The most significant skill within a data science function is domain knowledge. While the results of advanced applied mathematics such as machine learning are impressive, without understanding the reality that these models describe, they are devoid of meaning and can cause more harm than good. Anyone analyzing a problem needs to understand the context of the issues and the potential solutions. The subject of data science is not the data itself, but the reality this data describes. Data science is about things and people in the real world, not about numbers and algorithms.

A domain expert understands the impact of any confounding variables on the outcomes. An experienced subject-matter expert can quickly perform a sanity check on the process and results of the analysis. Domain knowledge is essential because each area of expertise uses a different paradigm to understand the world.

Each domain of human enquiry or activity has different methodologies to collect and analyze data. Analyzing objective engineering data follows a different approach to subjective data about people or unstructured data in a corpus of text. The analyst needs to be familiar with the tools of the trade within the problem domain. The example of a graduate professional beating a team of machine learning experts with a linear regression shows the importance of domain knowledge.

Domain expertise can also become a source of bias and prevent innovative ways of looking at information. Solutions developed through systematic research can contradict long-held beliefs about a specific topic that are sometimes hard to shift. Implementing data science is thus as much a cultural process as it is a scientific one, which is the topic of Chapter 4, The Data-Driven Organization.

Mathematical Knowledge

The analyst uses mathematical skills to convert data into actionable insights. Mathematics consists of pure mathematics as a science, and applied mathematics that helps us to solve problems. The scope of applied mathematics is broad, and data science is opportunistic in choosing the most suitable method. Various types of regression models, graph theory, k-means clustering, decision trees, and so on, are some of the favorite tools of a data scientist. The creative application of complex applied mathematics is one of the two distinguishing factors between traditional business analysis and data science.

Combining subject-matter expertise with mathematical skills is the domain of traditional research and analysis. The notion of conventional research is, however, evolving toward using the principles of data science by using reproducible computer code and sharing the source data through websites such as FigShare (https://figshare.com/).

Numbers are the foundations of mathematics, and the craft of quantitative science is to describe our analogue reality in a model that we can manipulate to predict the future. Not all mathematical skills are necessarily about numbers but can also revolve around logical relationships between words and concepts. Contemporary numerical methods help us to understand relationships between people, the logical structure of a text, and many other aspects beyond the realm of traditional numeric analysis.

Computer Science

Not that long ago, most of the information collected by an organization was stored on paper and archived in copious volumes of arch lever files. Analyzing this information was an arduous task that involved many hours of transcribing information into a format that is useful for analysis.

In the twenty-first century, almost all data is an electronic resource. To create value from this resource, data engineers extract it from a database, combine it with other sources, and clean the data before analysts can make sense of it. This requirement implies that a data scientist needs to have computing skills. Conway uses the term hacking skills, which many people interpret as negative. Conway is, however, not referring to a hacker in the sense of somebody who nefariously uses computers, but in the original meaning of the word as a developer with creative computing skills. The core competency of a hacker, developer, coder, or whatever other term might be preferable, is algorithmic thinking and understanding the logic of data structures. These competencies are vital in extracting and cleaning data to prepare it for the next step of the data science process.

The importance of hacking skills for a data scientist implies that we should move away from point-and-click systems and spreadsheets and instead write code in a suitable programming language. The flexibility and power of a programming language far exceed the capabilities of graphical user interfaces and leads to reproducible analysis, as discussed in Chapter 2, Good Data Science.

The mathematical interpretation of reality needs to be translated into computer code. One of the factors that spearheaded data science into popularity is that the available toolkit has grown substantially in the past ten years. Open source computing languages such as R and Python can implement complex algorithms that were previously the domain of specialized software and supercomputers. Open source software has accelerated innovation in how we analyze data and has placed complex machine learning within reach of anyone who is willing to try to learn the skills.

Conway defines the danger zone as the area where domain knowledge and computing skills combine, without a good grounding in mathematics. Somebody might have enough computing skills to be pushing buttons on a business intelligence platform or spreadsheet. The user-friendliness of some analysis platforms can be detrimental to the outcomes of the analysis because they create the illusion of accuracy. Point-and-click analysis hides the inner workings from the user, creating a black-box result. Although the data might be perfectly structured, valid and reliable, a wrongly applied analytical method leads to useless outcomes.

The Unicorn Data Scientist?

Conway's diagram is often cited in the literature on data science. His simple model helped to define the craft of data science. Other data scientists have proposed more complex models, but they all originate with Conway's basic idea.

The diagram illustrates that the difference between traditional research skills or business analytics lies in the ability to understand and write code. A data scientist understands the problem they seek to resolve, they have the mathematical expertise to analyze the problem, and they possess the computing skills to convert this knowledge into outcomes.

It could be argued that the so-called skills are missing from this picture. However, communication, managing people, facilitating change and so on, are competencies that belong to every professional who works in a complex environment, not just the data scientist.

Some critics of this idea point out that these people are unicorns – that is, they don't exist. Data scientists that possess all these skills are mythical employees that don't exist in the real world. Most data scientists start from either mathematics or computer science, after which it is hard to become a domain expert. This book is written from the point of view that we can breed unicorns by teaching domain experts how to write code and, where required, enhance their mathematical skills.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image