Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Practical Data Analysis Using Jupyter Notebook
Practical Data Analysis Using Jupyter Notebook

Practical Data Analysis Using Jupyter Notebook: Learn how to speak the language of data by extracting useful and actionable insights using Python

Arrow left icon
Profile Icon Marc Wintjen
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.9 (9 Ratings)
Paperback Jun 2020 322 pages 1st Edition
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Marc Wintjen
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.9 (9 Ratings)
Paperback Jun 2020 322 pages 1st Edition
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$17.99 $26.99
Paperback
$38.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Practical Data Analysis Using Jupyter Notebook

Fundamentals of Data Analysis

Welcome and thank you for reading my book. I'm excited to share my passion for data and I hope to provide the resources and insights to fast-track your journey into data analysis. My goal is to educate, mentor, and coach you throughout this book on the techniques used to become a top-notch data analyst. During this process, you will get hands-on experience using the latest open source technologies available such as Jupyter Notebook and Python. We will stay within that technology ecosystem throughout this book to avoid confusion. However, you can be confident the concepts and skills learned are transferable across open source and vendor solutions with a focus on all things data.

In this chapter, we will cover the following:

  • The evolution of data analysis and why it is important
  • What makes a good data analyst?
  • Understanding data types and why they are important
  • Data classifications and data attributes explained
  • Understanding data literacy

The evolution of data analysis and why it is important

To begin, we should define what data is. You will find varying definitions but I would define data as the digital persistence of facts, knowledge, and information consolidated for reference or analysis. The focus of my definition should be the word persistence because digital facts remain even after the computers used to create them are powered down and they are retrievable for future use. Rather than focus on the formal definition, let's discuss the world of data and how it impacts our daily lives. Whether you are reading a review to decide which product to buy or viewing the price of a stock, consuming information has become significantly easier to allow you to make informed data-driven decisions.

Data has been entangled into products and services across every industry from farming to smartphones. For example, America's Grow-a-Row, a New Jersey farm to food bank charity, donated over 1.5 million pounds of fresh produce to feed people in need throughout the region each year, according to their annual report. America's Grow-a-Row has thousands of volunteers and uses data to maximize production yields during the harvest season.

As the demand for being a consumer of data has increased, so has the supply side, which is characterized as the producer of data. Producing data has increased in scale as the technology innovations have evolved. I'll discuss this in more detail shortly, but this large scale consumption and production can be summarized as big data. A National Institute of Standards and Technology report defined big data as consisting of extensive datasets—primarily in the characteristics of volume, velocity, and/or variability—that require a scalable architecture for efficient storage, manipulation, and analysis.

This explosion of big data is characterized by the 3Vs, which are Volume, Velocity, and Variety,and has become a widely accepted concept among data professionals:

  • Volume is based on the quantity of data that is stored in any format such as image files, movies, and database transactions, which are measured in gigabytes, terabytes, or even zettabytes. To give context, you can store hundreds of thousands of songs or pictures on one terabyte of storage space. Even more amazing than the figures is how much it costs you. Google Drive, for example, offers up to 5 TB (terabytes) of storage for free according to their support site.
  • Velocity is the speed at which data is generated. This process covers how data is both produced and consumed. For example, batch processing is how data feeds are sent between systems where blocks of records or bundles of files are sent and received. Modern velocity approaches are real time, streams of data where the data flow is in a constant state of movement.
  • Variety is all of the different formats that data can be stored in, including text, image, database tables, and files. This variety has created both challenges and opportunities for analysis because of the different technologies and techniques required to work with the data.

Understanding the 3Vs is important for data analysis because you must become good at being both a consumer and producer of data. The simple questions of how your data is stored, when this file was produced, where the database table is located, and in what format I shouldstore the output of my analysis of the data can all be addressed by understanding the 3Vs.

There is some debate—for which I disagree—that the 3Vs should increase to include Value, Visualization, and Veracity. No worries, we will cover these concepts throughout this book.

This leads us to a formal definition of data analysis which is defined as a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusion, and supporting decision-making, as stated in Review of business intelligence through data analysis.

Xia, B. S., & Gong, P. (2015). Review of business intelligence through data analysis. Benchmarking, 21(2), 300-311. doi:10.1108/BIJ-08-2012-0050

What I like about this definition is the focus on solving problems using data without the focus on which technologies are used. To make this possible there have been some significant technological milestones, the introduction of new concepts, and people who have broken down the barriers.

To showcase the evolution of data analysis, I compiled a few tables of key events from the years of 1945 until 2018 that I feel are the most influential. The following table is comprised of innovators such as Dr. E.F. Codd, who created the concept of a database to the launch of the iPhone device that spawned the mobile analytics industry.

The following diagram was collected from multiple sources and centralized in one place as a table of columns and rows and then visualized using this dendrogram chart. I posted the CSV file in the GitHub repository for reference: https://github.com/PacktPublishing/python-data-analysis-beginners-guide. Organizing the information and conforming the data in one place made the data visualization easier to produce and enables further analysis:

That process of collecting, formatting, and storing data in this readable format demonstrates the first step of becoming a producer of data. To make this information easier to consume, I summarize these events by decades in the following table:

Decade

Count of Milestones

1940s

2

1950s

2

1960s

1

1970s

2

1980s

5

1990s

9

2000s

14

2010s

7

From the preceding summary table, you can see that the majority of these milestone events occurred in the 1990s and 2000s. What is insightful about this analysis is that recent innovations have removed the barriers of entry for individuals to work with data. Before the 1990s, the high purchasing costs of hardware and software restricted the field of data analysis to a relatively limited number of careers. Also, the costs associated with access to the underlying data for analysis were great. It typically required higher education and specialized careers in software programming or an actuary.

A visual way to look at this same data would be a trend bar chart, as shown in the following diagram. In this example, the height of the bars represents the same information as in the preceding table and the Count of Milestone events is on the left or the y axis. What is nice about this visual representation of the data is that it is a faster way for the consumer to see the upward pattern of where most events occur without scanning through the results found in the preceding diagram or table:

The evolution of data analysis is important to understand because now you know some of the pioneers who opened doors for opportunities and careers working with data, along with key technology breakthroughs, significantly reducing the time to make decisions regarding data both as consumers and producers.

What makes a good data analyst?

I will now break down the contributing factors that make up a good data analyst. From my experience, a good data analyst must be eager to learn and continue to ask questions throughout the process of working with data. The focus of those questions will vary based on the audience who are consuming the results. To be an expert in the field of data analysis, excellent communication skills are required so you can understand how to translate raw data into insights that can impact change in a positive way. To make it easier to remember, use the following acronyms to help to improve your data analyst skills.

Know Your Data (KYD)

Knowing your data is all about understanding the source technology that was used to create the data along with the business requirements and rules used to store it. Do research ahead of time to understand what the business is all about and how the data is used. For example, if you are working with a sales team, learn what drives their team's success. Do they have daily, monthly, or quarterly sales quotas? Do they do reporting for month-end/quarter-end that goes to senior management and has to be accurate because it has financial impacts on the company? Learning more about the source data by asking questions about how it will be consumed will help focus your analysis when you have to deliver results.


KYD is also about data lineage, which is understanding how the data was originally sourced including the technologies used along with the transformations that occurred before, during, and afterward. Refer back to the 3Vs so you can effectively communicate the responses from common questions about the data such as where this data is sourced from or who is responsible for maintaining the data source.

Voice of the Customer (VOC)

The concept of VOC is nothing new and has been taught at universities for years as a well-known concept applied in sales, marketing, and many other business operations. VOC is the concept of understanding customer needs by learning from or listening to their needs before, during, and after they use a company's product or service. The relevance of this concept remains important today and should be applied to every data project that you participate in. This process is where you should interview the consumers of the data analysis results before even looking at the data. If you are working with business users, listen to what their needs are by writing down the specific points on what business questions are they trying to answer.

Schedule a working session with them where you can engage in a dialog. Make sure you focus on their current pain points such as the time to curate all of the data used to make decisions. Does it take three days to complete the process every month? If you can deliver an automated data product or a dashboard that can reduce that time down to a few mouse clicks, your data analysis skills will make you look like a hero to your business users.

During a tech talk at a local university, I was asked the difference between KYD and VOC. I explained that both are important and focused on communicating and learning more about the subject area or business. The key differences are prepared versus present. KYD is all about doing your homework ahead of time to be prepared before talking to experts. VOC is all about listening to the needs of your business or consumers regarding the data.

Always Be Agile (ABA)

The agile methodology has become commonplace in the industry for application, web, and mobile development Software Development Life Cycle (SDLC). One of the reasons that makes the agile project management process successful is that it creates an interactive communication line between the business and technical teams to iteratively deliver business value through the use of data and usable features.

The agile process involves creating stories with a common theme where a development team completes tasks in 2-3 week sprints. In that process, it is important to understand the what and the why for each story including the business value/the problem you are trying to solve.

The agile approach has ceremonies where the developers and business sponsors come together to capture requirements and then deliver incremental value. That improvement in value could be anything from a new dataset available for access to a new feature added to an app.

See the following diagram for a nice visual representation of these concepts. Notice how these concepts are not linear and should require multiple iterations, which help to improve the communication between all people involved in the data analysis before, during, and after delivery of results:

Finally, I believe the most important trait of a good data analyst is a passion for working with data. If your passion can be fueled by continuously learning about all things data, it becomes a lifelong and fulfilling journey.

Understanding data types and their significance

As we have uncovered with the 3Vs, data comes in all shapes and sizes, so let's break down some key data types and better understand why they are important. To begin, let's classify data in general terms of unstructured, semi-structured, and structured.

Unstructured data

The concept behind unstructured data, which is textual in nature, has been around since the 1990s and includes the following examples: the body of an email message, tweets, books, health records, and images. A simple example of unstructured data would be an email message body that is classified as free text. Free text may have some obvious structure that a human can identify such as free space to break up paragraphs, dates, and phone numbers, but having a computer identify those elements would require programming to classify any data elements as such. What makes free text challenging for data analysis is its inconsistent nature, especially when trying to work with multiple examples.

When working with unstructured data, there will be inconsistencies because of the nature of free text including misspellings, the different classification of dates, and so on. Always have a peer review of the workflow or code used to curate the data.

Semi-structured data

Next, we have semi-structured data, which is similar to unstructured, however, the key difference is the addition of tags, which are keywords or any classification used to create a natural hierarchy. Examples of semi-structured data are XML and JSON files, as shown in the following code:

{
"First_Name": "John",
"Last_Name": "Doe",
"Age": 42,
"Home_Address": {
"Address_1": "123 Main Street",
"Address_2": [],
"City": "New York",
"State": "NY",
"Zip_Code": "10021"
},
"Phone_Number": [
{
"Type": "cell",
"Number": "212-555-1212"
},
{
"Type": "home",
"Number": "212 555-4567"
}
],
"Children": [],
"Spouse": "yes"
}

This JSON formatted code allows for free text elements such as a street address, a phone number, and age, but now has tags created to identify those fields and values, which is a concept called key-value pairs. This key-value pair concept allows for the classification of data with a structure for analysis such as filtering, but still has the flexibility to change the elements as necessary to support the unstructured/free text. The biggest advantage of semi-structured data is the flexibility to change the underlining schema of how the data is stored. The schema is a foundational concept of traditional database systems that defines how the data must be persisted (that is, stored on disk).

The disadvantage to semi-structured data is that you may still find inconsistencies with data values depending on how the data was captured. Ideally, the burden on consistency is moved to the User Interface (UI), which would have coded standards and business rules such as required fields to increase the quality but, as a data analyst who practices KYD, you should validate that during the project.

Structured data

Finally, we have structured data, which is the most common type found in databases and data created from applications (apps or software) and code. The biggest benefit with structured data is consistency and relatively high quality between each record, especially when stored in the same database table. The conformity of data and structure is the foundation for analysis, which allows both the producers and consumers of structured data to come to the same results. The topic of databases, or Database Management Systems (DBMS) and Relational Database Management Systems(RDMS) is vast and will not be covered here, but having some understanding will help you to become a better data analyst.

The following diagram is a basic Entity-Relationship (ER) diagram of three tables that would be found in a database:

In this example, each entity would represent physical tables stored in the database, named car, part, and car_part_bridge. The relationship between the car and part is defined by the table called car_part_bridge, which can be classified by multiple names such as bridge, junction, mapping, or link table. The name of each field in the table would be on the left such as part_id, name, or description found in the part table.

The pk label next to the car_id and part_idfield names helps to identify the primary keys for each table. This allows for one field to uniquely identify each record found in the table. If aprimary keyin one table exists in another table, it would be called aforeign key, which is the foundation of how the relationship between the tables is defined and ultimately joined together.

Finally, the text aligned on the right side next to the field name labeled as int or text is the data type for each field. We will cover that concept next and you should now feel comfortable with the concepts for identifying and classifying data.

Common data types

Data types are a well-known concept in programming languages and is found in many different technologies. I have simplified the definition as, the details of the data that is stored and its intended usage. A data type will also create consistency for each data value as it's stored on disk or memory.

Data types will vary depending on the software and/or database used to create the structure. Hence, we won't be covering all the different types across all of the different coding languages but let's walk through a few examples:

Common data type

Common short name

Sample value

Example usage

Integers

int

1235

Counting occurrences, summing values, or the average of values such as sum (hits)

Booleans

bit

TRUE

Conditional testing such as if sales > 1,000, true else false

Geospatial

float or spatial

40.229290, -74.936707

Geo analytics based on latitude and longitude

Characters/string

char

A

Tagging, binning, or grouping data

Floating-point numbers

float or double

2.1234

Sales, cost analysis, or stock price

Alphanumeric strings

blob or varchar

United States

Tagging, binning, encoding, or grouping data

Time

time, timestamp, date

8/19/2000

Time-series analysis or year-over-year comparison

Technologies change and legacy systems will offer opportunities to see data types that may not be common. The best advice when dealing with new data types is to validate the source systems that are created by speaking to an SME (Subject Matter Expert) or system administrator, or to ask for documentation that includes the active version used to persist the data.

In the preceding table, I've created a summary of some common data types. Getting comfortable understanding the differences between data types is important because it determines what type of analysis can be performed on each data value. Numeric data types such as integer (int), floating-point numbers (float), ordoubleare used for mathematical calculations of values such as the sum of sales, count of apples, or the average price of a stock. Ideally, the source system of the record should enforce the data type but there can be and usually are exceptions.

As you evolve your data analysis skills, helping to resolve data type issues or offer suggestions to improve them will make the quality and accuracy of reporting better throughout the organization.

String data types that are defined in the preceding table as characters (char) and alphanumeric strings (varchar or blob) can be represented as text such as a word or full sentence. Time is a special data type that can be represented and stored in multiple ways such as 12 PM EST or a date such as 08/19/2000. Consider geospatial coordinates such as latitude and longitude, which can be stored in multiple data types depending on the source system.

The goal of this chapter is to introduce you to the concept of data types and future chapters will give direct, hands-on experience of working with them. The reason why data types are important is to avoid incomplete or inaccurate information when presenting facts and insights from analysis. Invalid or inconsistent data types also restrict the ability to create accurate charts or data visualizations. Finally, good data analysis is about having confidence and trust that your conclusions are complete with defined data types that support your analysis.

Data classifications and data attributes explained

Now that we understand more about data types and why they are important, let's break down the different classifications of data and understand the different data attribute types. To begin with a visual, let's summarize all of the possible combinations in the following summary diagram:

In the preceding diagram, the boxes directly below data have the three methods to classify data, which are continuous, categorical, or discrete.

Continuous data is measurable, quantified with a numeric data type, and has a continuous range with infinite possibilities. The bottom boxes in this diagram are examples so you can easily find them for reference. Continuous data examples include a stock price, weight in pounds, and time.

Categorical (descriptive) data will have values as astringdata type. Categorical data isqualified so it would describe something specific such as a person, place, or thing. Some examples include a country of origin, a month of the year, the different types of trees, and your family designation.

Just because data is defined as categorical, don't assume the values are all alike or consistent. A month can be stored as 1, 2, 3; Jan, Feb, Mar; or January, February, March, or in any combination. You will learn more about how to clean and conform your data for consistent analysis in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.

A discrete data type can be either continuous or categorical depending on how it's used for analysis. Examples include the number of employees in a company. You must have an integer/whole number representing the count for each employee, because you can never have partial results such as half an employee. Discrete data is continuous in nature because of its numeric properties but also has limits that make it similar to categorical. Another example would be the numbers on a roulette wheel. There is a limit of whole numbers available on the wheel from 1 to 36, 0, or 00 that a player can bet on, plus the numbers can be categorized as red, black, or green depending on the value.

If only two discrete values exist, such as yes/no or true/false or 1/0, it can also be classified as binary.

Data attributes

Now that we understand how to classify data, let's break down the attribute types available to better understand how you can use them for analysis. The easiest method to break down types is to start with how you plan on using the data values for analysis:

  • Nominal data is defined as data where you can distinguish between different values but not necessarily order them. It is qualitative in nature, so think of nominal data as labels or names as stocks or bonds where math cannot be performed on them because they are string values. With nominal values, you cannot determine whether the word stocks or bonds are better or worse without additional information.
  • Ordinal data is ordered data where a ranking exists, but the distance or range between values cannot be defined. Ordinal data is qualitative using labels or names but now the values will have a natural or defined sequence. Similar to nominal data, ordinal data can be counted but not calculated with all statistical methods.

An example is assigning 1 = low, 2 = medium, and 3 = high values. This has a natural sequence but the difference between low and high cannot be quantified by itself. The data assigned to low and high values could be arbitrary or have additional business rules behind it.

Another common example of ordinal data is natural hierarchies such as state, county, and city, or grandfather, father, and son. The relationship between these values are well defined and commonly understood without any additional information to support it. So, a son will have a father but a father cannot be a son.

  • Interval data is like ordinal data, but the distance between data points is uniform. Weight on a scale in pounds is a good example because the difference between the values from 5 to 10, 10 to 15, and 20 to 25 are all the same. Note that not every arithmetic operation can be performed on interval data so understanding the context of the data and how it should be used becomes important.

Temperature is a good example to demonstrate this paradigm. You can record hourly values and even provide a daily average, but summing the values per day or week would not provide accurate information for analysis. See the following diagram, which provides an hourly temperature for a specific day. Notice the x axis breaks out the hours and the y axis provides the average, which is labeled Avg Temperature, in Fahrenheit. The values between each hour must be an average or mean because an accumulation of temperature would provide misleading results and inaccurate analysis:

  • Ratio data allows for all arithmetic operations including sum, average, median, mode, multiplication, and division. The data types of integer and float discussed earlier are classified as ratio data attributes, which in turn are also numeric/quantitative. Also, time could be classified as ratio data,however, I decided tofurther break down this attribute because of how often it is used for data analysis.
Note that there are advanced statistical details about ratio data attributes that are not covered in this book, such as having an absolute or true zero, so I encourage you to learn more about the subject.
  • Time data attributes as a rich subject that you will come across regularly during your data analysis journey. Time data covers both date and time or any combination, for example, the time as HH:MM AM/PM, such as 12:03 AM; the year as YYYY, such as 1980; a timestamp represented as YYYY-MM-DD hh:mm:ss, such as 2000-08-19 14:32:22; or even a date as MM/DD/YY, such as 08/19/00. What's important to recognize when dealing with time data is to identify the intervals between each value so you can accurately measure the difference between them.
It is common during many data analysis projects that you find gaps in the sequence of time data values. For example, you are given a dataset with a range between 08/01/2019 to 08/31/2019 but only 25 distinct date values exist versus 30 days of data. There are various reasons for this occurrence including system outages where log data was lost. How to handle those data gaps will vary depending on the type of analysis you have to perform, including the need to fill in missing results. We will cover some examples in Chapter 7, Exploring Cleaning, Refining, and Blending Datasets.

Understanding data literacy

Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio as the ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.

Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about the data types and data attributes with the engineering team versus searching for those details aimlessly.

Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.

Reading data

What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.

Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.

For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from the What makes a good data analyst? section that we should start asking questions about the file attachment:

  • What methods were used to create the file (human or machine)?
  • What system(s) and workflow were used to create the file?
  • Who created the file and when was it created?
  • How often does this file refresh and is it manual or automated?

Asking these questions helps you to understand the concept of data lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.

Working with data

I define working withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.

Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.

For example, if you have a timestamp field in your dataset, is it using UTC (Coordinated Universal Time) or EST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.

Analyzing data

Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down to dimensions and measures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).

The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.

One quick solution to help to classify how the data should be used for analysis would be a data dictionary, which is defined as a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.

You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.

Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create a data dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:

Product

Customer

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Product 1

Customer A

$ 1,000.00

$ 2,000.00

$ 6,000.00

Product 1

Customer B

$ 1,000.00

$ 500.00

Product 2

Customer A

$ 1,000.00

Product 2

Customer C

$ 2,000.00

$ 2,500.00

$ 5,000.00

Product 3

Customer A

$ 1,000.00

$ 2,000.00

Product 4

Customer B

$ 1,000.00

$ 3,000.00

Product 5

Customer A

$ 1,000.00

In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields, Record 1 to Record 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeled Estimated Data Type and Dimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:

Field Name

Record 1

Record 2

Record 3

Estimated Data Type

Dimension or Measure

Product

Product 1

Product 1

Product 2

varchar

Dimension

Customer

Customer A

Customer B

Customer A

varchar

Dimension

Quarter 1

$ 1,000.00

float

Measure

Quarter 2

$ 2,000.00

$ 1,000.00

$ 1,000.00

float

Measure

Quarter 3

$ 6,000.00

$ 500.00

float

Measure

Quarter 4

float

Measure

Using this technique can help you to ask the following questions about the data to ensure you understand the results:

  • What year does this dataset represent or is it an accumulation of multiple years?
  • Does each quarter represent a calendar year or fiscal year?
  • Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?

Arguing about the data

Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.

From a consumer's perspective, trust and verify is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.

As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.

Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.

Here are a few to get started:

Let's begin with the Kagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.

Next is the supporting data from FiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.

Another important open source for data would be The World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under a Creative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.

Summary

Let's look back at what we learned in this chapter and the skills obtained before we move forward. First, we covered a brief history of data analysis and the technological evolution of data by paying homage to the people and milestone events that made working with data possible using modern tools and techniques. We walked through an example of how to summarize these events using a data visual trend chart that showed how recent technology innovations have transformed the data industry.

We focused on why data has become important to make decisions from both a consumer and producer perspective by discussing the concepts for identifying and classifying data using structured, semi-structured, and unstructured examples and the 3Vsof big data: Volume, Velocity, and Variety.

We answered the question of what makes a good data analyst using the techniques of KYD, VOC, and ABA.

Then, we went deeper into understandingdata types by walking through the differences between numbers (integer and float) versus strings (text, time, dates, and coordinates). This includedbreaking down data classifications (continuous, categorical, and discrete) and understanding data attribute types.

We wrapped up this chapter by introducing the concept of data literacyand its importance to the consumers and producers of data by improving communication between them.

In our next chapter,we will get more hands-on by installing and setting up an environment for data analysis and so begin the journey of applying the concepts learned about data.

Further reading

Here are some links that you can refer to for gathering more information about the following topics:

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Find out how to use Python code to extract insights from data using real-world examples
  • Work with structured data and free text sources to answer questions and add value using data
  • Perform data analysis from scratch with the help of clear explanations for cleaning, transforming, and visualizing data

Description

Data literacy is the ability to read, analyze, work with, and argue using data. Data analysis is the process of cleaning and modeling your data to discover useful information. This book combines these two concepts by sharing proven techniques and hands-on examples so that you can learn how to communicate effectively using data. After introducing you to the basics of data analysis using Jupyter Notebook and Python, the book will take you through the fundamentals of data. Packed with practical examples, this guide will teach you how to clean, wrangle, analyze, and visualize data to gain useful insights, and you'll discover how to answer questions using data with easy-to-follow steps. Later chapters teach you about storytelling with data using charts, such as histograms and scatter plots. As you advance, you'll understand how to work with unstructured data using natural language processing (NLP) techniques to perform sentiment analysis. All the knowledge you gain will help you discover key patterns and trends in data using real-world examples. In addition to this, you will learn how to handle data of varying complexity to perform efficient data analysis using modern Python libraries. By the end of this book, you'll have gained the practical skills you need to analyze data with confidence.

Who is this book for?

This book is for aspiring data analysts and data scientists looking for hands-on tutorials and real-world examples to understand data analysis concepts using SQL, Python, and Jupyter Notebook. Anyone looking to evolve their skills to become data-driven personally and professionally will also find this book useful. No prior knowledge of data analysis or programming is required to get started with this book.

What you will learn

  • Understand the importance of data literacy and how to communicate effectively using data
  • Find out how to use Python packages such as NumPy, pandas, Matplotlib, and the Natural Language Toolkit (NLTK) for data analysis
  • Wrangle data and create DataFrames using pandas
  • Produce charts and data visualizations using time-series datasets
  • Discover relationships and how to join data together using SQL
  • Use NLP techniques to work with unstructured data to create sentiment analysis models
  • Discover patterns in real-world datasets that provide accurate insights

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 19, 2020
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781838826031
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jun 19, 2020
Length: 322 pages
Edition : 1st
Language : English
ISBN-13 : 9781838826031
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 126.97
Python Data Cleaning Cookbook
$48.99
Practical Data Analysis Using Jupyter Notebook
$38.99
Python Data Analysis
$38.99
Total $ 126.97 Stars icon
Banner background image

Table of Contents

17 Chapters
Section 1: Data Analysis Essentials Chevron down icon Chevron up icon
Fundamentals of Data Analysis Chevron down icon Chevron up icon
Overview of Python and Installing Jupyter Notebook Chevron down icon Chevron up icon
Getting Started with NumPy Chevron down icon Chevron up icon
Creating Your First pandas DataFrame Chevron down icon Chevron up icon
Gathering and Loading Data in Python Chevron down icon Chevron up icon
Section 2: Solutions for Data Discovery Chevron down icon Chevron up icon
Visualizing and Working with Time Series Data Chevron down icon Chevron up icon
Exploring, Cleaning, Refining, and Blending Datasets Chevron down icon Chevron up icon
Understanding Joins, Relationships, and Aggregates Chevron down icon Chevron up icon
Plotting, Visualization, and Storytelling Chevron down icon Chevron up icon
Section 3: Working with Unstructured Big Data Chevron down icon Chevron up icon
Exploring Text Data and Unstructured Data Chevron down icon Chevron up icon
Practical Sentiment Analysis Chevron down icon Chevron up icon
Bringing It All Together Chevron down icon Chevron up icon
Works Cited Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.9
(9 Ratings)
5 star 55.6%
4 star 11.1%
3 star 11.1%
2 star 11.1%
1 star 11.1%
Filter icon Filter
Top Reviews

Filter reviews by




Amazon Customer Nov 27, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Being a novice, I found this book very well written, informative and above all, helpful. The author knows how to explain concepts in easy to understand language - I didn’t have to google every other word.
Amazon Verified review Amazon
JuliaA Sep 16, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
great book and very fast delivery
Amazon Verified review Amazon
Amazon Customer - EF Sep 14, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Do NOT listen to the bad reviews, they must have something against the author of the book. This book was VERY well-written, easy to understand, and useful. Highly recommend!!
Amazon Verified review Amazon
Richard Lyons Mar 15, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Step by step guides are essential!
Amazon Verified review Amazon
Josef Bauer Aug 19, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Für mich genau der richtige Einstieg in die Data Analysis Welt und Jupyter Notebook. Etwas Python-Kenntnisse sind allerdings sehr von Vorteil.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.