Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Tableau 2021
Mastering Tableau 2021

Mastering Tableau 2021: Implement advanced business intelligence techniques and analytics with Tableau , Third Edition

Arrow left icon
Profile Icon Marleen Meier Profile Icon David Baldwin
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2 (29 Ratings)
Paperback May 2021 792 pages 3rd Edition
eBook
S$74.99 S$107.99
Paperback
S$134.99
Subscription
Free Trial
Arrow left icon
Profile Icon Marleen Meier Profile Icon David Baldwin
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2 (29 Ratings)
Paperback May 2021 792 pages 3rd Edition
eBook
S$74.99 S$107.99
Paperback
S$134.99
Subscription
Free Trial
eBook
S$74.99 S$107.99
Paperback
S$134.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering Tableau 2021

All About Data – Getting Your Data Ready

Ever asked yourself whether your data is clean enough to be analyzed? It's likely that everyone who works with data has, which is why this chapter is dedicated to getting your data ready for analysis, otherwise known as data cleaning.

The first part of this chapter is theory-oriented and does not include exercises. A careful reading of this information is encouraged, since it provides a foundation for greater insight. The latter portion of the chapter provides various exercises specifically focused on data preparation.

Now let's dive into this fascinating topic with the goal of enriching our understanding and becoming ever-better data stewards.

In this chapter, we will discuss the following topics:

  • Understanding Hyper
  • Data mining and knowledge discovery process models
  • CRISP-DM
  • Focusing on data preparation
  • Surveying data
  • Cleaning messy data

Since Tableau Desktop 10.5 has been on the market for some time, you may already have heard of Hyper. Regardless of whether you have or not, continue reading for a primer on this useful tool!

Understanding Hyper

In this section, we will explore Tableau's data-handling engine, and how it enables structured yet organic data mining processes in enterprises. Since the release of Tableau 10.5, we can now make use of Hyper, a high-performing database, allowing us to query source data faster than ever before. Hyper is Tableau's data-handling engine, which is usually not well understood by even advanced developers, because it's not an overt part of day-to-day activities; however, if you want to truly grasp how to prepare data for Tableau, this understanding is crucial.

Hyper originally started as a research project at the University of Munich in 2008. In 2016, it was acquired by Tableau and appointed as the dedicated data engine group of Tableau, maintaining its base and employees in Munich. Initially in 10.5, Hyper replaced the earlier data-handling engine only for extracts. It is still true that live connections are not touched by Hyper, but Tableau Prep Builder now runs on the Hyper engine too, with more use cases to follow.

What makes Hyper so fast? Let's have a look under the hood!

The Tableau data-handling engine

The vision shared by the founders of Hyper was to create a high-performing, next-generation database; one system, one state, no trade-offs, and no delays. And it worked—today, Hyper can serve general database purposes, data ingestion, and analytics at the same time.

Memory prices have decreased exponentially. If we go back in time to 1996, 1 GB of data could cost $45,000 in production costs. Today, much more than that can be found on every phone, or even on a smartwatch, costing as little $2 to produce. The same goes for CPUs; transistor counts increased according to Moore's law, while other features stagnated. Memory is cheap but processing still needs to be improved.

Moore's Law is the observation made by Intel co-founder Gordon Moore that the number of transistors on a chip doubles every two years while the costs are halved. In 1965, Gordon Moore noticed that the number of transistors per square inch on integrated circuits had doubled every two years since their invention. Information on Moore's Law can be found on Investopedia at https://www.investopedia.com/terms/m/mooreslaw.asp.

While experimenting with Hyper, the founders measured that handwritten C code is faster than any existing database engine, so they came up with the idea to transform Tableau Queries into LLVM code and optimize it simultaneously, all behind the scenes, so the Tableau user won't notice it. This translation and optimization comes at a cost; traditional database engines can start executing code immediately. Tableau needs to first translate queries into code, optimize that code, then compile it into machine code, after which it can be executed. So the big question is, is it still faster? As proven by many tests on Tableau Public and other workbooks, the answer is yes!

Furthermore, if there is a query estimated to be faster if executed without the compilation to machine code, Tableau has its own virtual machine (VM) in which the query will be executed right away. And next to this, Hyper can utilize 99% of available CPUs, whereas other paralyzed processes can only utilize 29% of available CPUs. This is due to the unique and innovative technique of morsel-driven parallelization.

For those of you that want to know more about morsel-driven parallelization, a paper, which later on served as a baseline for the Hyper engine, can be found at https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf.

If you want to know more about the Hyper engine, I can highly recommend the following video at https://youtu.be/h2av4CX0k6s.

Hyper parallelizes three steps of traditional data warehousing operations:

  • Transactions and Continuous Data Ingestion (Online Transaction Processing, or OLTP)
  • Analytics (Online Analytical Processing, or OLAP)
  • Beyond Relational (Online Beyond Relational Processing, or OBRP)

Executing those steps simultaneously makes it more efficient and more performant, as opposed to traditional systems where those three steps are separated and executed one after the other.

To sum up, Hyper is a highly specialized database engine that allows us as users to get the best out of our queries. If you recall, in Chapter 1, Getting Up to Speed – A Review of the Basics, we already saw that every change on a sheet or dashboard, including drag and drop pills, filters, and calculated fields, among others, are translated into queries. Those queries are pretty much SQL-lookalikes, however, in Tableau we call the querying engine VizQL.

VizQL, another hidden gem in your Tableau Desktop, is responsible for visualizing data into chart format and is fully executed in memory. The advantage is that no additional space on the database site is required here. VizQL is generated when a user places a field on a shelf. VizQL is then translated into SQL, MDX, or Tableau Query Language (TQL), and passed to the backend data source with a driver. Two aspects of the VizQL module are of primary importance:

  • VizQL allows you to change field attributions on the fly
  • VizQL enables table calculations

We'll discuss these two aspects in more detail in the following sections.

Changing field attribution

In this section, we'll demonstrate how changing a worksheet's field attribution will allow you more flexibility in your dashboard creation.

Let's look at the World Happiness Report. Please navigate to the workbook associated with this chapter on https://public.tableau.com/profile/marleen.meier and open the Score per country sheet. We create the following worksheet by placing AVG(Happiness Score) and Country on the Columns and Rows shelves respectively. AVG(Happiness Score) is, of course, treated as a measure in this case. Lastly, sort the countries by their happiness score, highest to lowest.

Let's take a look at this in the following screenshot:

A picture containing screenshot

Description automatically generated

Figure 2.1: Happiness score by country

Next, please create a second worksheet called Score/Rank to analyze the scores relative to the ranks by using Happiness Score on Rows and Happiness Rank on Columns. Both pills should be continuous, hence green-colored.

In order to accomplish this, the user defines Happiness Rank as a Dimension, as shown in the following screenshot:

Figure 2.2: Ranking score per country

Please note that Columns and Rows have been moved to the left for better readability. This can be achieved by dragging and dropping the shelves.

In order to add steps to your visualization, click on Path in the Marks Card and select the second option, Step.

You can view the code generated by Tableau that is passed to the data source with the performance recorder, which is accessible through Help, then Settings and Performance, and then Start Performance Recording. See Chapter 13, Improving Performance, for additional details.

Studying the SQL generated by VizQL to create the preceding visualization is particularly insightful:

SELECT ['Happiness Report$'].[Happiness.Rank] AS [Happiness.Rank],
AVG(['Happiness Report$'].[Happiness.Score]) AS [avg:Happiness.Score:ok] FROM
[dbo].['Happiness Report$'] ['Happiness Report$'] GROUP BY ['Happiness Report$'].[Happiness.Rank]"

The GROUP BY clause clearly communicates that Happiness Rank is treated as a dimension because grouping is only possible on dimensions. The takeaway is to note that VizQL enables the analyst to change the SQL code input by changing a field from measure to dimension rather than the source metadata. This on-the-fly ability enables creative exploration of the data that's not possible with other tools, and avoids lengthy exercises attempting to define all possible uses for each field.

The previous section taught us how we can manipulate data types in Tableau itself without touching the data source and its metadata itself. In the next section, we will take a closer look at table calculations.

Table calculation

In this section, we will explore how VizQL's table calculations can be used to add data to a dashboard without adding any data to the data source.

In the following example, which can be viewed by opening Sheet 4 on this chapter's workbook, note that Freedom on the vertical axis is set to Quick Table Calculation and Moving Average. Calculating a Moving Average, Running Total, or other such comparison calculations can be quite challenging to accomplish in a data source. Not only must a data architect consider what comparison calculations to include in the data source, but they must also determine the dimensions for which these calculations are relevant.

VizQL greatly simplifies such challenges using table calculations, as seen in the following screenshot:

A screenshot of a cell phone

Description automatically generated

Figure 2.3: Moving average

Taking a look at the relevant portion of SQL generated by the preceding worksheet shows that the table calculation is not performed by the data source. Instead, it is performed in Tableau by the VizQL module.

The following is the SQL query:

SELECT SUM([Happiness Report$].[Freedom]) AS [sum:Freedom:ok], [Happiness Report$].[Happiness.Rank] AS [Happiness.Rank]
FROM [dbo].[Happiness Report$] [Happiness Report$] GROUP BY ['Happiness Report$'].[Happiness.Score]

To reiterate, nothing in the preceding call to the data source generates the moving average. Only an aggregated total is returned, and Tableau calculates the moving average with VizQL.

Hyper takeaways

This overview of the Tableau data-handling engine demonstrates a flexible approach to interfacing with data. Knowledge of the data-handling engine is helpful if you want to understand the parameters for Tableau data readiness. Two major takeaways from this section are as follows:

  • It is not necessary to explicitly define data types and roles for optimal Tableau usage.
  • Comparison calculations such as moving averages and running totals can be addressed by table calculations in Tableau and thus do not need to be calculated in advance.

The knowledge of these two takeaways can reduce data preparation and data modeling efforts, and thus helps us streamline the overall data mining life cycle. Don't worry too much about data types and data that can be calculated based on the fields you have in your database. Tableau can do all the work for you in this respect. In the next section, we will discuss what you should consider from a data source perspective.

Data mining and knowledge discovery process models

Data modeling, data preparation, database design, data architecture—the question that arises is, how do these and other similar terms fit together? This is no easy question to answer! Terms may be used interchangeably in some contexts and be quite distinct in others. Also, understanding the interconnectivity of any technical jargon can be challenging.

In the data world, data mining and knowledge discovery process models attempt to consistently define terms and contextually position and define the various data sub-disciplines. Since the early 1990s, various models have been proposed.

Survey of the process models

In the following table, we can see a comparison of blueprints for conducting a data mining project with three data processing models, all of which are used to discover patterns and relationships in data in order to help make better business decisions.

The following list is adapted from A Survey of Knowledge Discovery and Data Mining Process Models by Lukasz A. Kurgan and Petr Musilek, and published in The Knowledge Engineering Review, Volume 21, Issue 1, March 2006.

Later on, we will see how Tableau comes into play and makes this process easier and faster for us.

KDD

CRISP-DM

SEMMA

Phase I

Selection

Business understanding

Sample

Phase II

Pre-processing

Data understanding

Explore

Phase III

Transformation

Data preparation

Modify

Phase IV

Data mining

Modeling

Model

Phase V

Interpretation/ evaluation

Evaluation

Assess

Phase VI

Consolidate knowledge

Deployment

-

Since CRISP-DM is used by four to five times the number of people as the closest competing model (SEMMA), it is the model we will consider in this chapter. For more information, see http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html.

The important takeaway is that each of these models grapples with the same problems, particularly concerning the understanding, preparing, modeling, and interpreting of data.

CRISP-DM

Cross Industry Standard Process for Data Mining (CRISP-DM) was created between 1996 and 2000 as a result of a consortium including SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA. It divides the process of data mining into six major phases, as shown in the CRISP-DM reference model in the preceding comparison table.

This model provides a bird's-eye view of a data-mining project life cycle. The sequence of the phases are not rigid; jumping back and forth from phase to phase is allowed and expected. Data mining does not cease upon the completion of a particular project. Instead, it exists as long as the business exists, and should be constantly revisited to answer new questions as they arise.

In the next section, we will consider each of the six phases that comprise CRISP-DM and explore how Tableau can be used throughout the life cycle. We will particularly focus on the data preparation phase, as that is the phase encompassing data cleaning, the focus of this chapter. By considering the following steps, you will be able to understand in more detail what a full data mining process circle looks like under CRISP-DM. This framework can be used to make your workflow in Tableau more efficient by working according to an established model.

CRISP-DM phases

In the following sections, we will briefly define each of the six CRISP-DM phases and include high-level information on how Tableau might be used.

Phase I – business understanding:

  • This phase determines the business objectives and corresponding data mining goals. It also assesses risks, costs, and contingencies, and culminates in a project plan.
  • Tableau is a natural fit for presenting information to enhance business understanding.

Phase II – data understanding:

  • This phase begins with an initial data collection exercise. The data is then explored to discover early insights and identify data quality issues.
  • Once the data is collected into one or more relational data sources, Tableau can be used to effectively explore the data and enhance data understanding.

Phase III – data preparation:

  • This phase includes data selection, cleaning, construction, merging, and formatting.
  • Tableau can be effectively used to identify the preparation tasks that need to occur; that is, Tableau can be used to quickly identify the data selection, cleaning, merging, and so on, that should be addressed. Additionally, Tableau can sometimes be used to do actual data preparation. We will walk through examples in the next section.

As Tableau has evolved, functionality has been introduced to do more and more of the actual data preparation work as well as the visualization. For example, Tableau Prep Builder is a standalone product that ships with Tableau Desktop and is dedicated to data prep tasks. We will cover Tableau Prep Builder in Chapter 3, Tableau Prep Builder.

Phase IV – modeling:

  • In this phase, data modeling methods and techniques are considered and implemented in one or more data sources. It is important to choose an approach that works well with Tableau; for example, as discussed in Chapter 6, All About Data – Data Densification, Cubes, and Big Data, Tableau works better with relational data sources than with cubes.
  • Tableau has some limited data modeling capabilities, such as pivoting datasets through the data source page.

Phase V – evaluation:

  • The evaluation phase considers the results; do they meet the business goals with which we started the data mining process? Test the model on another dataset, for example, from another day or on a production dataset, and determine whether it works as well in the workplace as it did in your tests.
  • Tableau is an excellent fit for considering the results during this phase, as it is easy to change the input dataset as long as the metadata layer remains the same—for example, the column header stays the same.

Phase VI – deployment:

  • This phase should begin with a carefully considered plan to ensure a smooth rollout. The plan should include ongoing monitoring and maintenance to ensure continued streamlined access to quality data. Although the phase officially ends with a final report and accompanying review, the data mining process, as stated earlier, continues for the life of the business. Therefore, this phase will always lead to the previous five phases.
  • Tableau should certainly be considered a part of the deployment phase. Not only is it an excellent vehicle for delivering end-user reporting; it can also be used to report on the data mining process itself. For instance, Tableau can be used to report on the performance of the overall data delivery system and thus be an asset for ongoing monitoring and maintenance.
  • Tableau Server is the best fit for Phase VI. We will discuss this separate Tableau product in Chapter 14, Interacting with Tableau Server/Online.

Now that we have learned what a full data mining circle looks like (and looked like pre-Tableau) and understood that every step can be executed in Tableau, we can see how it makes sense that data people celebrate Tableau Software products.

The phrase "data people" is especially memorable after listening to the song written for the 2019 Las Vegas Tableau Conference, at https://youtu.be/UBrH7MXf-Q4.

Tableau makes data mining so much easier and efficient, and the replication of steps is also easier than it was before, without Tableau. In the next section, we will take a look at a practical example to explore the content we've just learned with some hands-on examples.

Focusing on data preparation

As discussed earlier, Tableau can be used effectively throughout the CRISP-DM phases. Unfortunately, a single chapter is not sufficient to thoroughly explore how Tableau can be used in each phase. Indeed, such a thorough exploration may be worthy of an entire book! Our focus, therefore, will be directed to data preparation, since that phase has historically accounted for up to 60% of the data mining effort. Our goal will be to learn how Tableau can be used to streamline that effort.

Surveying data

Tableau can be a very effective tool for simply surveying data. Sometimes in the survey process, you may discover ways to clean the data or populate incomplete data based on existing fields. Sometimes, regretfully, there are simply not enough pieces of the puzzle to put together an entire dataset. In such cases, Tableau can be useful to communicate exactly what the gaps are, and this, in turn, may incentivize the organization to more fully populate the underlying data.

In this exercise, we will explore how to use Tableau to quickly discover the percentage of null values for each field in a dataset. Next, we'll explore how the data might be extrapolated from existing fields to fill in the gaps.

Establishing null values

The following are the steps for surveying the data:

  1. If you haven't done so just yet, navigate to https://public.tableau.com/profile/marleen.meier to locate and download the workbook associated with this chapter.
  2. Navigate to the worksheet entitled Surveying & Exploring Data.
  3. Drag Region and Country to the Rows shelf. Observe that in some cases the Region field has Null values for some countries:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.4: Null regions

  4. Right-click and Edit the parameter entitled Select Field. Note that the Data Type is set to Integer and we can observe a list that contains an entry for each field name in the dataset:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.5: Editing a parameter

  5. In the Data pane, right-click on the parameter we just created and select Show Parameter Control.
  6. Create a calculated field entitled % Populated and write the following calculation:
    SUM([Number of Records]) / TOTAL(SUM([Number of Records]))
    
  7. In the Data pane, right-click on % Populated and select Default Properties | Number Format…:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.6: Adjusting default properties

  8. In the resulting dialog box, choose Percentage.
  9. Create a calculated field entitled Null & Populated and add the following code. Note that the complete case statement is fairly lengthy but is also repetitive.

    In cases requiring a lengthy but repetitive calculation, consider using Excel to more quickly and accurately write the code. By using Excel's CONCATENATE function, you may be able to save time and avoid typos.

    In the following code block, the code lines represent only a percentage of the total but should be sufficient to enable you to produce the whole:

    CASE [Select Field]
    WHEN 1 THEN IF ISNULL ([Country]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 2 THEN IF ISNULL ([Region]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 3 THEN IF ISNULL ([Economy (GDP per Capita)]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 4 THEN IF ISNULL ([Family]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 5 THEN IF ISNULL ([Freedom]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 6 THEN IF ISNULL ([Happiness Rank]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 7 THEN IF ISNULL ([Happiness Score]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 8 THEN IF ISNULL ([Health (Life Expectancy)]) THEN 'Null Values' ELSE
    'Populated Values' END
    WHEN 9 THEN IF ISNULL ([Standard Error]) THEN 'Null Values' ELSE
    'Populated Values' END
    END
    
  10. Remove Region and Country from the Rows shelf.
  11. Place Null & Populated on the Rows and Color shelves and % Populate on the Columns and Label shelves:

    Figure 2.7: Populated values

  12. Change the colors to red for Null Values and green for Populated Values if desired. You can do so by clicking on Color in the Marks card and Edit Colors.
  13. Click on the arrow in the upper right corner of the Select Field parameter on your sheet and select Single Value List.
  14. Select various choices in the Select Field parameter and note that some fields have a high percentage of null values. For example, in the following diagram, 32.98% of records do not have a value for Region:
    A screenshot of a cell phone

Description automatically generated

Figure 2.8: Comparing null and populated values

Building on this exercise, let's explore how we might clean and extrapolate data from existing data using the same dataset.

Extrapolating data

This exercise will expand on the previous exercise by cleaning existing data and populating some of the missing data from known information. We will assume that we know which country belongs to which region. We'll use that knowledge to fix errors in the Region field and also to fill in the gaps using Tableau:

  1. Starting from where the previous exercise ended, create a calculated field entitled Region Extrapolated with the following code block:
    CASE [Country]
    WHEN 'Afghanistan' THEN 'Southern Asia'
    WHEN 'Albania' THEN 'Central and Eastern Europe'
    WHEN 'Algeria' THEN 'Middle East and Northern Africa'
    WHEN 'Angola' THEN 'Sub-Saharan Africa'
    WHEN 'Argentina' THEN 'Latin America and Caribbean'
    WHEN 'Armenia' THEN 'Central and Eastern Europe'
    WHEN 'Australia' THEN 'Australia and New Zealand'
    WHEN 'Austria' THEN 'Western Europe'
    //complete the case statement with the remaining fields in the data set
    END
    

    To speed up the tedious creation of a long calculated field, you could download the data to an Excel file and create the calculated field by concatenating the separate parts, as shown here:

    Figure 2.9: Compiling a calculation in Excel

    You can then copy them from Excel into Tableau. However, for this exercise, I have created a backup field called Backup, which can be found in the Tableau Workbook associated with this chapter, which contains the full calculation needed for the Region Extrapolated field. Use this at your convenience. The Solutions dashboard also contains all of the countries. You can therefore copy the Region Extrapolated field from that file too.

  2. Add a Region Extrapolated option to the Select Field parameter:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.10: Adding Region Extrapolated to parameter

  3. Add the following code to the Null & Populated calculated field:
    WHEN 10 THEN IF ISNULL ([Region Extrapolated]) THEN 'Null Values' ELSE
    'Populated Values' END
    
  4. Note that the Region Extrapolated field is now fully populated:
    A screenshot of a cell phone

Description automatically generated

Figure 2.11: Fully populated Region Extrapolated field

Now let's consider some of the specifics from the previous exercises:

  • Let's look at the following code block:

Note that the complete case statement is several lines long. The following is a representative portion.

CASE [% Populated]
WHEN 1 THEN IF ISNULL ([Country]) THEN 'Null Values' ELSE
'Populated Values' END
...

This case statement is a row-level calculation that considers each field in the dataset and determines which rows are populated and which are not. For example, in the representative line of the preceding code, every row of the Country field is evaluated for nulls. The reason for this is that a calculated field will add a new column to the existing data—only in Tableau, not in the data source itself—and every row will get a value. These values can be N/A or null values.

  • The following code is the equivalent of the quick table calculation Percent of Total:
    SUM([Number of Records]) / TOTAL(SUM([Number of Records]))
    

    In conjunction with the Null & Populated calculated field, it allows us to see what percentage of our fields are actually populated with values.

It's a good idea to get into the habit of writing table calculations from scratch, even if an equivalent quick table calculation is available. This will help you more clearly understand the table calculations.

  • The following CASE statement is an example of how you might use one or more fields to extrapolate what another field should be:
    CASE [Country]
    WHEN 'Afghanistan' THEN 'Southern Asia'
    ... END
    

    For example, the Region field in the dataset had a large percentage of null values, and even the existing data had errors. Based on our knowledge of the business (that is, which country belongs to which region) we were able to use the Country field to achieve 100% population of the dataset with accurate information.

Nulls are a part of almost every extensive real dataset. Understanding how many nulls are present in each field can be vital to ensuring that you provide accurate business intelligence. It may be acceptable to tolerate some null values when the final results will not be substantially impacted, but too many nulls may invalidate results. However, as demonstrated here, in some cases one or more fields can be used to extrapolate the values that should be entered into an underpopulated or erroneously populated field.

As demonstrated in this section, Tableau gives you the ability to effectively communicate to your data team which values are missing, which are erroneous, and how possible workarounds can be invaluable to the overall data mining effort. Next, we will look into data that is a bit messier and not in a nice column format. Don't worry, Tableau has us covered.

Cleaning messy data

The United States government provides helpful documentation for various bureaucratic processes. For example, the Department of Health and Human Services (HSS) provides lists of ICD-9 codes, otherwise known as International Statistical Classification of Diseases and Related Health Problems codes. Unfortunately, these codes are not always in easily accessible formats.

As an example, let's consider an actual HHS document known as R756OTN, which can be found at https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R756OTN.pdf.

Cleaning the data

Navigate to the Cleaning the Data worksheet in this workbook and execute the following steps:

  1. Within the Data pane, select the R756OTN Raw data source:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.12: Selecting the raw file

  2. Drag Diagnosis to the Rows shelf and choose Add all members. Note the junk data that occurs in some rows:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.13: Adding Diagnosis to Rows

  3. Create a calculated field named DX with the following code:
    SPLIT( [Diagnosis], "", 1 )
    
  4. Create a calculated field named Null Hunting with the following code:
    INT(MID([DX],2,1))
    
  5. In the Data pane, drag Null Hunting from Measures to Dimensions.
  6. Drag Diagnosis, DX, and Null Hunting to the Rows shelf. Observe that Null is returned when the second character in the Diagnosis field is not numeric:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.14: Ordering fields on Rows

  7. Create a calculated field named Exclude from ICD codes containing the following code:
    ISNULL([Null Hunting])
    
  8. Clear the sheet of all fields, as demonstrated in Chapter 1, Getting Up to Speed – a Review of the Basics, and set the Marks card to Shape.
  9. Place Exclude from ICD Codes on the Rows, Color, and Shape shelves, and then place DX on the Rows shelf. Observe the rows labeled as True:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.15: Excluding junk data

  10. In order to exclude the junk data (that is, those rows where Exclude from ICD Codes equates to TRUE), place Exclude from ICD Codes on the Filter shelf and deselect True.
  11. Create a calculated field named Diagnosis Text containing the following code:
    REPLACE([Diagnosis],[DX] + "","")
    
  12. Place Diagnosis Text on the Rows shelf after DX. Also, remove Exclude from ICD Codes from the Rows shelf and the Marks Card, and set the mark type to Automatic:
    A screenshot of a cell phone

Description automatically generated

Figure 2.16: Observing the cleaned data

Now that we've completed the exercise, let's take a moment to consider the code we have used:

  • The SPLIT function was introduced in Tableau 9.0:
    SPLIT([Diagnosis], "", 1 )
    

    As described in Tableau's help documentation about the function, the function does the following:

    Returns a substring from a string, as determined by the delimiter extracting the characters from the beginning or end of the string.

    This function can also be called directly in the Data Source tab when clicking on a column header and selecting Split. To extract characters from the end of the string, the token number (that is, the number at the end of the function) must be negative.

  • Consider the following code, which we used to create the Null Hunting field:
    INT(MID([DX],2,1))
    

    The use of MID is quite straightforward, and is much the same as the corresponding function in Excel. The use of INT in this case, however, may be confusing. Casting an alpha character with an INT function will result in Tableau returning Null. This satisfactorily fulfills our purpose, since we simply need to discover those rows not starting with an integer by locating the nulls.

  • ISNULL is a Boolean function that simply returns TRUE in the case of Null:
    ISNULL([Null Hunting])
    
  • The REPLACE function was used while creating the Diagnosis Text field:
    REPLACE([Diagnosis],[DX] + "","")
    

    This calculated field uses the ICD-9 codes isolated in DX to remove those same codes from the Diagnosis field and thus provides a fairly clean description. Note the phrase fairly clean. The rows that were removed were initially associated with longer descriptions that thus included a carriage return. The resulting additional rows are what we removed in this exercise. Therefore, the longer descriptions are truncated in this solution using the replace calculation.

The final output for this exercise could be to export the data from Tableau as an additional source of data. This data could then be used by Tableau and other tools for future reporting needs. For example, the DX field could be useful in data blending.

Does Tableau offer a better approach that might solve the issue of truncated data associated with the preceding solution? Yes! Let's turn our attention to the next exercise, where we will consider regular expression functions.

Extracting data

Although, as shown in the previous exercise, Cleaning the data, the SPLIT function can be useful for cleaning data, regular expression functions are far more powerful and represent a broadening of the scope from Tableau's traditional focus on visualization and analytics to also include data cleaning capabilities. Let's look at an example that requires us to deal with some pretty messy data in Tableau. Our objective will be to extract phone numbers.

The following are the steps:

  1. If you have not already done so, please download the Chapter 2 workbook from https://public.tableau.com/profile/marleen.meier and open it in Tableau.
  2. Select the Extracting the Data tab.
  3. In the Data pane, select the String of Data data source and drag the String of Data field to the Rows shelf. Observe the challenges associated with extracting the phone numbers:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.17: Extracting data from a messy data format

  4. Access the underlying data by clicking the View data button and copy several rows:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.18: Accessing underlying data

  5. Navigate to http://regexpal.com/ and paste the data into the pane labeled Test String; that is, the second pane:
    A screenshot of a cell phone

Description automatically generated

    Figure 2.19: Regexpal

  6. In the first pane (the one labeled Regular Expression), type the following:
    \([0-9]{3}\)-[0-9]{3}-[0-9]{4}
    
  7. Return to Tableau and create a calculated field called Phone Number with the following code block. Note the regular expression nested in the calculated field:
    REGEXP_EXTRACT([String of Data (String of Data)],'(\([0-9]{3}\)-[0-9]{3}-[0-9]{4})')
    
  8. Place Phone Number on the Rows shelf, and observe the result:
    A screenshot of a cell phone

Description automatically generated

Figure 2.20: Extracting data final view

Now let's consider some of the specifics from the preceding exercise in more detail:

  • Consider the following code block:
    REGEXP_EXTRACT([String of Data],'()')
    

    The expression pattern is purposely excluded here as it will be covered in detail later. The '()' code acts as a placeholder for the expression pattern. The REGEXP_EXTRACT function used in this example is described in Tableau's help documentation as follows:

    Returns a substring of the given string that matches the capturing group within the regular expression pattern.
  • Note that as of the time of writing, the Tableau documentation does not communicate how to ensure that the pattern input section of the function is properly delimited. For this example, be sure to include '()' around the pattern input section to avoid a null output.
  • Nesting within a calculated field that is itself nested within a VizQL query can affect performance (if there are too many levels of nesting/aggregation).
  • There are numerous regular expression websites that allow you to enter your own code and help you out, so to speak, by providing immediate feedback based on sample data that you provide. http://regexpal.com/ is only one of those sites, so search as desired to find one that meets your needs!
  • Now, consider the expression:
    \([0-9]{3}\)-[0-9]{3}-[0-9]{4}
    

    In this context, the \ indicates that the next character should not be treated as special but as literal. For our example, we are literally looking for an open parenthesis. [0-9] simply declares that we are looking for one or more digits. Alternatively, consider \d to achieve the same results. The {3} designates that we are looking for three consecutive digits.

    As with the opening parenthesis at the beginning of the pattern, the \ character designates the closing parentheses as a literal. The - is a literal that specifically looks for a hyphen. The rest of the expression pattern should be decipherable based on the preceding information.

After reviewing this exercise, you may be curious about how to return just the email address. According to http://www.regular-expressions.info/email.html, the regular expression for email addresses adhering to the RFC 5322 standard is as follows:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-
]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-
\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-
]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-
z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-
5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-
\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-
\x7f])+)\])

Emails do not always adhere to RFC 5322 standards, so additional work may be required to truly clean email address data.

Although I won't attempt a detailed explanation of this code, you can read all about it at http://www.regular-expressions.info/email.html, which is a great resource for learning more about regular expressions. Also, YouTube has several helpful regular expression tutorials.

The final output for this exercise should probably be used to enhance existing source data. Data dumps such as this example do not belong in data warehouses; however, even important and necessary data can be hidden in such dumps, and Tableau can be effectively used to extract it.

Summary

We began this chapter with a discussion of the Tableau data-handling engine. This illustrated the flexibility Tableau provides in working with data. The data-handling engine is important to understand in order to ensure that your data mining efforts are intelligently focused. Otherwise, your effort may be wasted on activities not relevant to Tableau.

Next, we discussed data mining and knowledge discovery process models, with an emphasis on CRISP-DM. The purpose of this discussion was to get an appropriate bird's-eye view of the scope of the entire data mining effort. Tableau authors (and certainly end users) can become so focused on the reporting produced in the deployment phase that they end up forgetting or short-changing the other phases, particularly data preparation.

Our last focus in this chapter was on the phase that can be the most time-consuming and labor-intensive, namely data preparation. We considered using Tableau for surveying and also cleaning data. The data cleaning capabilities represented by the regular expression functions are particularly intriguing, and are worth further investigation.

Having completed our first data-centric discussion, we'll continue with Chapter 3, Tableau Prep Builder, looking at one of the newer features Tableau has brought to the market. Tableau Prep Builder is a dedicated data pre-processing interface that is able to reduce the amount of time you need for pre-processing even more. We'll take a look at cleaning, merging, filtering, joins, and the other functionality Tableau Prep Builder has to offer.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Master new features in Tableau 2021 to solve real-world analytics challenges
  • Perform geo-spatial, time series, and self-service analytics using real-life examples
  • Build and publish dashboards and explore storytelling using Python and R integration support

Description

Tableau is one of the leading business intelligence (BI) tools that can help you solve data analysis challenges. With this book, you will master Tableau's features and offerings in various paradigms of the BI domain. Updated with fresh topics including Quick Level of Detail expressions, the newest Tableau Server features, Einstein Discovery, and more, this book covers essential Tableau concepts and advanced functionalities. Leveraging Tableau Hyper files and using Prep Builder, you’ll be able to perform data preparation and handling easily. You’ll gear up to perform complex joins, spatial joins, unions, and data blending tasks using practical examples. Next, you’ll learn how to execute data densification and further explore expert-level examples to help you with calculations, mapping, and visual design using Tableau extensions. You’ll also learn about improving dashboard performance, connecting to Tableau Server and understanding data visualization with examples. Finally, you'll cover advanced use cases such as self-service analysis, time series analysis, and geo-spatial analysis, and connect Tableau to Python and R to implement programming functionalities within it. By the end of this Tableau book, you’ll have mastered the advanced offerings of Tableau 2021 and be able to tackle common and advanced challenges in the BI domain.

Who is this book for?

This book is designed for business analysts, business intelligence professionals and data analysts who want to master Tableau to solve a range of data science and business intelligence problems. The book is ideal if you have a good understanding of Tableau and want to take your skills to the next level.

What you will learn

  • Get up to speed with various Tableau components
  • Master data preparation techniques using Tableau Prep Builder
  • Discover how to use Tableau to create a PowerPoint-like presentation
  • Understand different Tableau visualization techniques and dashboard designs
  • Interact with the Tableau server to understand its architecture and functionalities
  • Study advanced visualizations and dashboard creation techniques
  • Brush up on powerful self-service analytics, time series analytics, and geo-spatial analytics

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : May 31, 2021
Length: 792 pages
Edition : 3rd
Language : English
ISBN-13 : 9781800561649
Vendor :
Salesforce
Category :
Languages :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : May 31, 2021
Length: 792 pages
Edition : 3rd
Language : English
ISBN-13 : 9781800561649
Vendor :
Salesforce
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just S$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total S$ 301.97
Mastering Tableau 2021
S$134.99
Learning Tableau 2020
S$106.99
Tableau Prep Cookbook
S$59.99
Total S$ 301.97 Stars icon
Banner background image

Table of Contents

17 Chapters
Getting Up to Speed – A Review of the Basics Chevron down icon Chevron up icon
All About Data – Getting Your Data Ready Chevron down icon Chevron up icon
Tableau Prep Builder Chevron down icon Chevron up icon
All About Data – Joins, Blends, and Data Structures Chevron down icon Chevron up icon
Table Calculations Chevron down icon Chevron up icon
All About Data – Data Densification, Cubes, and Big Data Chevron down icon Chevron up icon
Level of Detail Calculations Chevron down icon Chevron up icon
Beyond the Basic Chart Types Chevron down icon Chevron up icon
Mapping Chevron down icon Chevron up icon
Tableau for Presentations Chevron down icon Chevron up icon
Visualization Best Practices and Dashboard Design Chevron down icon Chevron up icon
Advanced Analytics Chevron down icon Chevron up icon
Improving Performance Chevron down icon Chevron up icon
Interacting with Tableau Server/Online Chevron down icon Chevron up icon
Programming Tool Integration Chevron down icon Chevron up icon
Another Book You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.2
(29 Ratings)
5 star 65.5%
4 star 10.3%
3 star 6.9%
2 star 10.3%
1 star 6.9%
Filter icon Filter
Top Reviews

Filter reviews by




paresh Dec 13, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book a informative but for 2000 rs I guess it does not deserve the price based on page quality.If you have kindle get the Kindle version
Amazon Verified review Amazon
Kriti Jul 01, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I started this book and I am very happy with the contents of this book. Every chapter is explained in a very detailed and very interactive way. There are visuals as to how to proceed with whatever topic we are learning.A great way to learn and easy to follow.
Amazon Verified review Amazon
Aditya Pant Jul 01, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is indeed the best one when it comes to introducing the concepts of visual analytics with rich user experiences and dynamic dashboards with advanced BI techniques. By the end of this book, I feel confident in building my skill-set by implementing the powerful features pertaining to Tableau 2021.
Amazon Verified review Amazon
N Aug 08, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
It can be overwhelming trying to keep up with the rapid pace of innovation that Tableau puts out every year. This book is a great resource for someone who's trying to get back into the swing of things, but it's also excellent for beginners who have no context for previous editions of Tableau Desktop. There are some good core foundational concepts like the difference between row, aggregate, and table-level calculations that will always be part of the Tableau environment and that are critical to understanding how to build out a dashboard environment. Overall an excellent reference book, with some information for newer tools like Tableau Prep that are also useful to understand.
Amazon Verified review Amazon
S_van Jun 13, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Starts with fundamentals and then takes a deep dive into the complex topics like:💥Advanced Analytics💥Improving Performance💥Tableau for Presentations💥Data Densification, Cubes and Big Dataand many more.....It's a fantastic book. Well written and with a lot of examples. The best part of this book is that you can follow along all the examples with the downloads, this way you can learn the concepts better.I have just started and thoroughly enjoying; cannot wait to read the book in entirety.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.