Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Engineering with dbt

You're reading from   Data Engineering with dbt A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

Arrow left icon
Product type Paperback
Published in Jun 2023
Publisher Packt
ISBN-13 9781803246284
Length 578 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Roberto Zagni Roberto Zagni
Author Profile Icon Roberto Zagni
Roberto Zagni
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Part 1: The Foundations of Data Engineering
2. Chapter 1: The Basics of SQL to Transform Data FREE CHAPTER 3. Chapter 2: Setting Up Your dbt Cloud Development Environment 4. Chapter 3: Data Modeling for Data Engineering 5. Chapter 4: Analytics Engineering as the New Core of Data Engineering 6. Chapter 5: Transforming Data with dbt 7. Part 2: Agile Data Engineering with dbt
8. Chapter 6: Writing Maintainable Code 9. Chapter 7: Working with Dimensional Data 10. Chapter 8: Delivering Consistency in Your Data 11. Chapter 9: Delivering Reliability in Your Data 12. Chapter 10: Agile Development 13. Chapter 11: Team Collaboration 14. Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms
15. Chapter 12: Deployment, Execution, and Documentation Automation 16. Chapter 13: Moving Beyond the Basics 17. Chapter 14: Enhancing Software Quality 18. Chapter 15: Patterns for Frequent Use Cases 19. Index 20. Other Books You May Enjoy

Querying data in SQL – syntax and operators

In this section, you will learn how to query existing data using SQL and the Snowflake example database.

In this chapter, and more generally in this book, you will get familiar with Snowflake syntax, which is modern and very standard. We will also use some proprietary extensions that make your work easier.

Snowflake query syntax

Snowflake supports querying data with the standard SELECT statement, which has the following basic syntax:

WITH …
SELECT …
FROM …
JOIN …
WHERE …
GROUP BY …
HAVING …
QUALIFY …
ORDER BY …
LIMIT …

The only mandatory part is select, so SELECT 1 is a valid query that just returns the value 1.

If you are familiar with SQL from other database systems, you will wonder what the QUALIFY clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. It is described later in this section.

We often use the terms query, command, and statement interchangeably when referring to some piece of SQL that you can execute.

Properly speaking, a command is a generic command such as SELECT or CREATE <object>, while a statement is one specific and complete instance of a command that can be run, such as SELECT 1 or CREATE TABLE my_table …;.

The term query should really only refer to SELECT statements, as SELECT statements are used to query data from the database, but query is often used with any statement that has to do with data.

You will also often hear the term clause used, such as the FROM clause or GROUP BY clause. Informally, you can think about it as a piece of a statement that follows the syntax and rules of that specific keyword.

The WITH clause

The WITH clause is optional and can only precede the SELECT command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT statement, which can be used later in the main SELECT statement as any other table-like object.

Defining a CTE is useful for the following:

  • Clarity: You can provide an informative name to a piece of SQL
  • Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once
  • Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL
  • Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance

The simplified syntax is as follows:

WITH [RECURSIVE]
  <cte1_name> AS (SELECT …)
  [, <cte2_name> AS (SELECT …)]
SELECT …

We will check out some examples here:

  1. Check whether we still have high-priority orders pending:
    WITH
    high_prio_orders as (
      SELECT *
      FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
      WHERE O_ORDERPRIORITY IN ('1-URGENT', '2-HIGH')
    )
    SELECT count(*)
    FROM high_prio_orders
    WHERE O_ORDERDATE < '1998-01-01'
      and O_ORDERSTATUS = 'O';

Here, you can see that the first expression encapsulates the business definition of high-priority order to be any order with the priority set to urgent or high priority. Without the CTE, you have to mix the business definition and other filtering logic in the WITH clause. Then, it would be unclear whether the status is part of the definition or is just a filter that we are applying now.

  1. Calculate some metrics for customers in the auto industry:
      WITH
      auto_customer_key as (
        SELECT C_CUSTKEY 
      FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
      WHERE C_MKTSEGMENT = 'AUTOMOBILE'
    ),
    orders_by_auto_customer as (
      SELECT O_ORDERKEY
      FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
      WHERE O_CUSTKEY in (SELECT * FROM auto_customer_key)
    ),
    metrics as (
      SELECT 'customers' as metric, count(*) as value
      FROM auto_customer
      UNION ALL
      SELECT 'orders by these customers', count(*)
      FROM orders_by_auto_customer
    )
    SELECT * FROM metrics;

In this example, the CTEs and the final query remain short and very simple. You can start to appreciate that the clear labeling of intentions and the simple pieces of SQL make the full query easy to understand for anyone.

The SELECT clause

The SELECT command can appear in two possible forms: as the central clause of a SELECT statement or used as a clause in other statements. In both cases, it defines the set of columns and calculated values returned by the statement.

The simplified syntax of a SELECT clause in Snowflake is a list of column definitions separated by commas, with the optional distinct keyword to omit duplicates:

SELECT [DISTINCT]
  <column_definition_1> [, <column_definition_2> …]

The SELECT clause is used to provide the list of columns that the query will return.

For each column definition that is provided, it is also possible to provide a column alias, which will be used to identify that column definition in the results.

When two column definitions have the same name, an alias must be provided for at least one of the two to avoid a name clash in the results.

A column definition can be one of these four expressions:

<column_definition> =
   [object.*]         -- all columns in the object
  |[object.col_name]  -- the named column in the object
  |[object.$n]        -- the n-th column in the object
  |[<expression>]     -- the value of the expression
 AS <col_alias>

Let’s describe in detail these four expressions:

  1. The star symbol, *, indicates all possible columns; it means “all columns from all tables” if it is applied without an alias or “all the columns from the aliased object” if applied with an alias.
    • Star without an alias will return all columns from both tables:
      SELECT * FROM table_1, table_2
    • Star with an alias (or table name) will return all columns from the object with the alias. In this case, it will return all columns from table_1, as it is applied to its alias:
      SELECT t1.* FROM table_1 as t1, table_2 as t2
    • Snowflake has introduced two powerful extensions to the Star Syntax: EXCLUDE to remove some columns from being returned and RENAME to rename a column while selecting it, as shown in this example:
      SELECT * EXCLUDE (c3, c2) RENAME (c4 as cx, c5 as cy)
      FROM table_1
  2. A column name, optionally with an alias indicating the object the column comes from and/or an alias to use for the column in the results:
    SELECT
        O_ORDERKEY,
        ord.O_CUSTKEY,
        cust.C_NAME as CUSTOMER_NAME
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS as ord
    JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER as cust
        ON cust.C_CUSTKEY = ord.O_CUSTKEY;

Note that here, the designers of the TPCH database have gone a long way to keep column names unique between all tables, by adding the table initials as a prefix to column names. This is a pretty old style as it makes it more difficult to recognize fields that contain the same values as C_CUSTKEY and O_CUSTKEY.

Looking at our example query, we can see the following:

  • Object aliases are optional if the name of the column is unique in the tables referenced in the query, as it is for all columns in the TPCH database.
  • Aliases can in any case be used for clarity, like with cust.C_NAME.
  • Object aliases are mandatory when referencing a column name that appears in more than one referenced table. This would have been the case if both fields had been just CUSTKEY, without the C_ and O_ table prefixes.
  • Lastly, a column alias can always be used, even just to rename a column with an initial name that we do not like or is not clear, as in the case of a name that would not be clear, so we rename it to CUSTOMER_NAME.
  1. A column number after a dollar sign, such as $1 or $3.

This allows us to reference columns that do not have a name, such as when reading out of a CSV file without a header. This is also useful, especially in scripts, to reference columns that we do not know the name of, but we know their position:

SELECT $1 as ORDER_KEY, $2 as CUST_KEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;

In this case, we read from a table that has column names, and we can always use column numbers instead of column names. But this is rarely done when reading from a table or a view because by using the numbers, the result will depend on the order of the columns in the object, which can change over time if the object is recreated.

  1. An expression, like a mathematical expression or a function call, that evaluates to some value for each row:
    SELECT
        P_PARTKEY
        , UPPER(P_NAME) as P_NAME
        , P_RETAILPRICE
        , P_RETAILPRICE * 0.9 as P_DISCOUNTED_PRICE
    FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART";

The preceding example shows two expressions: a mathematical expression that multiplies a column by a constant to produce a new column and a function call that converts the p_name column to uppercase, keeping the same column name.

Important note

The ability highlighted here, to write arbitrary expressions and assign a name to the result, is at the core of how SQL is used to transform data and is one of the most common things that we will do with dbt.

In this section, we have seen that the SELECT clause allows us four great abilities:

  1. To decide what data from the source tables to keep in the result of a query
  2. To calculate new data from source data using arbitrary complex expressions
  3. To provide a new name for the data that we want in the query result
  4. To keep only one copy of each distinct row in the result, using the DISTINCT keyword

The FROM clause

The FROM clause introduces the table objects used in a SQL statement as the source of data or target of the command.

In the case of a SELECT statement, the FROM clause can list none, one, or more table objects, each with its own alias. The rows in the result of the SELECT statement will potentially have all the columns from all the tables referenced in the FROM clause.

The simplified syntax looks like this:

SELECT …
FROM [tableObject1 [AS alias1]]
   [, tableObject2 [AS alias2] …]

Table objects, in the context of a SELECT statement, are a combination of the following:

  • Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the FROM clause.
  • Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones.
  • A VALUES clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input.
  • A few other objects that can be read from, such as a LATERAL subquery or a staged file.

When no table is listed, then the columns in the SELECT clause must use constants and global objects such as the current_date function, as in the following example:

SELECT 1 + 1 as sum, current_date as today;

When we use the VALUES clause, we can define an inline table, like in this example:

SELECT * FROM ( VALUES
  ('IT', 'ITA', 'Italy')
 ,('US', 'USA', 'United States of America')
 ,('SF', 'FIN', 'Finland (Suomi)')
  as inline_table (code_2, code_3, country_name)
);

When the FROM clause lists more than one table object, the result is the Cartesian product of the elements in these tables. To avoid a Cartesian explosion, it is important to include a WHERE clause that restricts the combinations, keeping only the ones we want.

Important note

The Cartesian product is the result of pairing the rows of all the objects in the product in all possible ways. This means that the number of rows returned by the Cartesian product is the product of the number of rows in each object in the product.

The expression Cartesian explosion is often used to refer to the fact that the number of rows returned by the Cartesian product increases very quickly and can easily create performance issues.

The following query, thanks to the WHERE clause, will generate a more meaningful and compact set of rows, ideally equal to the number of rows that exist in the LINEITEM table:

SELECT count(*)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" as l
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"   as o
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" as c
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART"     as p
WHERE o.O_ORDERKEY = l.L_ORDERKEY
  and c.C_CUSTKEY  = o.O_CUSTKEY
  and p.P_PARTKEY  = l.L_PARTKEY
;

In the previous query, we have used a WHERE clause to keep in the results only the rows that are really related. We achieve this by using equivalence constraints between the fields of the related tables.

The previous query generates a result of exactly 6,001,215 rows, which is the number of rows in the LINEITEM table, instead of the astronomical number resulting from the multiplication of the number of rows: 6,001,215 x 1,500,000 x 150,000 x 200,000, which is pretty much 2,7 x 10^23.

I’ll let you calculate the exact result…

The consequence is that the full query processes 6 million rows and runs in a few seconds, while the query without the WHERE clause will try to process 10^18 times more rows, taking very many hours or days to complete, in the process using up a lot of resources for nothing, as the resulting data provides nothing of interest.

The exact correspondence in the count derives from the fact that for each line item, there is only one order, for each order only one customer, and so on.

Providing the relevant WHERE clause will help with reducing the result by many orders of magnitude even if the keys don’t match so precisely. We will deal with this topic in more detail in Chapter 3, where we talk about data modeling.

This example illustrates well the power of SQL and why it is important to express what we want correctly.

Tip

As a basic rule, never do an unrestricted Cartesian product unless you know what you are doing and you are fine with the exploded number of rows that you will generate.

The JOIN clause

JOIN is a subclause of the FROM clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN cannot exist outside of a FROM clause.

The simplified syntax of JOIN looks like this:

SELECT …
FROM tableObject1 AS to1
[<join type>] JOIN tableObject2 as to2 ON <condition_A>
[[<join type>] JOIN tableObject3 as to3 ON <condition_B>]
…

The condition in each join is a Boolean expression, often using columns from the base to1 table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants.

It is normal to chain more than one JOIN inside a FROM clause to pick data from multiple tables at once, and you generally do not need to think too much about the order of the joins, but it is good to know that the order might matter. In the case of Snowflake, the order is from left to right (think of the statement written on one row).

We will look in more detail into the JOIN clause in the next section, where we will introduce the different types of joins and some examples.

We have seen that we can combine data from multiple tables, listing them in the FROM clause, and that we can use the WHERE clause to restrict the resulting Cartesian product to the combined rows that we actually want to keep. This use of FROM and WHERE is equivalent to the INNER JOIN construct, which we will see in the Combining data in SQL – the JOIN clause section.

The WHERE clause

The WHERE clause specifies the conditions that the rows involved in the command need to match.

The rows for which the expression evaluates to true are kept/processed.

In the context of a SELECT statement, the WHERE clause limits the result of the query to the subset of rows that verify the overall condition set by the WHERE clause.

The simplified syntax is as follows:

SELECT …
WHERE <predicate>

<predicate> can be as simple as the literal true or false or a complex expression, including logical operators and database functions that evaluate to a Boolean value (true, false, or NULL) for every row that the query processes.

Tip

An expression that returns a Boolean value is called a predicate.

The following example returns only the rows where the order total is greater than 500,000:

SELECT O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_TOTALPRICE > 500000;

The following example does not return any row, but defines two columns with the same name and type as the columns in the source table and two with the name and type that we provided:

SELECT O_ORDERKEY,O_CUSTKEY, 1 as an_int, null::number as a_num
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE false;

This may be a good trick when you need to define some column names and types. Remember that you do not need a FROM clause if you do not want to use columns from an existing table…

Important note

Be careful when NULL is involved as it might not behave as you expect.

As an example, the NULL = NULL expression evaluates to NULL.

In a WHERE clause, it means that the corresponding row is discarded.

We have seen in a previous example using the FROM clause that the WHERE clause can be used to specify some types of JOIN. We suggest using the JOIN … ON … syntax for joins and the WHERE clause to filter rows. We will look at more examples in the Combining data in SQL - the JOIN clause section.

The GROUP BY clause

A GROUP BY clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression.

The simplified syntax is as follows:

SELECT …
GROUP BY groupExpr1 [, groupExpr2 …]

The group by expression can be one of the following:

  • Column name: The result of the expression is the value of the column with the given name. All rows with the same value in the column are grouped.

The following query calculates the sum of all orders by each customer (customer key):

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY O_CUSTKEY;
  • Number: The value to group by is the value of the column in the given position in the result of the SELECT statement. The value is evaluated at the end of the statement, after applying any function or expression.

The following query is the same as the previous one, using the ordinal 1 instead of the O_CUSTKEY column name to indicate the column to group by:

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY 1;
  • SQL expression: Any expression using any combination of data from the query. The value to group by is the result of the expression.

The following query calculates the total orders by year, using the year() function, which returns the year from a date:

SELECT YEAR(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE);

We have also added the ORDER BY clause with the same expression as the GROUP BY so that we get the output in a nice order. This is a pretty common pattern.

When more than one group by expression is provided, the rows are grouped together according to the values of all the group by expressions.

In the previous example, you could have used GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) to have the order total split by month (and year), like in the following:

SELECT YEAR(O_ORDERDATE),MONTH(O_ORDERDATE),sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

The HAVING clause

The HAVING clause filters the rows resulting from a GROUP BY clause according to a predicate:

SELECT …
GROUP BY …
HAVING <predicate>

The predicate is an expression that returns a Boolean value and references:

  • Constants
  • Expressions that appear in the GROUP BY clause
  • Aggregated functions that can be calculated according to the GROUP BY clause

As an example, we could extend the example from the GROUP BY topic to look at the months where we have less than 10,000 orders.

The following query does it:

SELECT YEAR(O_ORDERDATE), MONTH(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
HAVING count(*) < 10000
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

In this query, we have used the count() aggregate function to count how many lines, one for each order, there are in one group.

Tip

The HAVING clause is for GROUP BY what the WHERE clause is for the FROM clause.

We will look at the third filtering clause, QUALIFY, in the next topic.

The QUALIFY clause

The QUALIFY clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by.

We will dedicate the last section of this chapter to window functions.

The QUALIFY clause is an optional SQL clause that proves extremely useful in analytical workloads, so it is implemented in Snowflake.

To recap a bit of what we have seen, these are the three clauses that are used to filter data:

  • WHERE applies to the data being read from the sources specified in the FROM / JOIN clauses
  • HAVING applies to the data that has been grouped by a GROUP BY clause
  • QUALIFY applies to the data that has been calculated by a window function

One of the most common usages of the QUALIFY clause together with the simple row_number() window function is to defend against undesired duplicates or select one specific row with respect to other rows that represent different versions of the same object:

SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM"
QUALIFY row_number()
            over(partition by L_ORDERKEY, L_LINENUMBER
                 order by L_COMMITDATE desc ) = 1;

This query selects all fields from a table, just filtering out undesired rows for which the row_number window function returns a value different from 1.

The row_number window function, assigns a progressive number from 1 onward to all the rows in the same window, following the row order, and then restarts from 1 with the next window.

Picking the rows where it is equal to 1 means keeping only the first row for each window.

The windows are defined by the over(…) clause, which comprises a partition by part that defines the windows, in a way similar to group by, and an order by that provides the ordering in the window. We will look in more detail at this in the section devoted to windows functions.

Tip

When using the QUALIFY clause, as shown in the previous query, we avoid the need to create a column with the result of the row_number window function and then use a subquery to filter on it. That is the normal way of doing this kind of filtering in a database that does not support the QUALIFY clause. QUALIFY is much simpler to read and use.

Now, let’s extend the order example from the previous section to select out the “good months,” when we have higher total sales than the average sales for the year they belong to:

WITH
monthly_totals as (
  SELECT
    YEAR(O_ORDERDATE) as year,
    MONTH(O_ORDERDATE) as month,
    sum(O_TOTALPRICE) as month_tot
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
  GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
)
SELECT year, month, month_tot
       ,avg(month_tot) over(partition by YEAR) as year_avg
FROM monthly_totals
QUALIFY month_tot > year_avg
ORDER BY YEAR, MONTH;

Here, we have put together a few of the clauses that we have seen so far, using a WITH clause to define a CTE named monthly_totals with our previous query to calculate monthly totals.

We have then defined a query that reads from the CTE and uses the avg window function to calculate the average monthly sales for each month of the full year that a particular month belongs to. Then, we use the QUALIFY clause to keep only the rows where the monthly total is greater than the yearly average.

Tip

The previous example illustrates why the window functions are also called analytical functions.

Important note

Please note that the window function calculates a value for each row using the defined windows, not changing the number of rows. In contrast, using the avg aggregate function with a GROUP BY clause would have reduced the number of rows to one per group.

The ORDER BY clause

The ORDER BY clause specifies in which order the query or window function lays out the rows.

The simplified syntax is as follows:

SELECT …
ORDER BY orderExpr1 [ASC|DESC] [NULLS FIRST|LAST] [, orderExpr2 …]

Each order by expression can be made out of three parts:

  1. One expression that identifies what to order on. It can be either a column alias, a position ordinal, or an expression, as we have seen for the GROUP BY clause. Please refer to that section for more details.
  2. An optional direction for the sorting: ASC for ascending or DESC for descending sort.
  3. An optional specification of how to sort null values: NULLS FIRST or NULLS LAST, which are self-explanatory.

Please look at previous sections for examples of ORDER BY in action, in both queries and window function definitions with over().

The LIMIT/FETCH clause

The LIMIT clause is used to restrict the number of rows returned by the query.

LIMIT and FETCH are synonyms, with slightly different syntax.

Let’s look at the LIMIT syntax:

SELECT …
[ORDER BY …]
LIMIT <count> [OFFSET <start>]

The count parameter is a number that specifies the maximum number of rows to return.

If the OFFSET part is present, the returned rows are the ones after the start position. This allows tools connecting to a database to retrieve all the results in chunks of the desired size. As an example, a LIMIT 10 OFFSET 10 clause would retrieve rows from 11 to 20.

Important note

If no ORDER BY clause is present, the order of the rows is undefined and could differ in each execution of the same query. In this case, the result of a LIMIT clause is non-deterministic because what rows are returned depends on the order in which the rows happen in the result set.

Query clause order of evaluation

In the previous sections, we have seen all the clauses that can appear in a SELECT statement.

Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause:

  1. FROM and its JOIN subclause, which are used to identify the source data for the query.
  2. The WHERE clause, which is used to filter out the source data that we do not want.

This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE whenever possible to just bring in the data you need.

  1. The GROUP BY clause, which groups the source data left after applying the WHERE clause and calculates the aggregate functions on the grouped data.
  2. The HAVING clause, which filters on the results of GROUP BY.
  3. Partitioning of the windows and calculation of the window functions.
  4. The QUALIFY clause, which filters on the results of the window functions.
  5. The DISTINCT keyword, if applied to the SELECT clause, which removes duplicated rows.
  6. The ORDER BY clause, which puts the resulting rows in the desired order.
  7. The LIMIT clause, which caps the rows returned by the query to the desired amount.

SQL operators

When writing queries, we can perform operations on the data handled by the query.

We do so by building expressions that return the desired value, using functions and operators.

We can perform an operation pretty much everywhere a value is expected: in the SELECT clause to provide the desired outputs by transforming the inputs, in the WHERE clause or ON part of a JOIN clause, HAVING clause, or QUALIFY clause to identify what should or should not be returned by the query, in GROUP BY to decide how to aggregate, and so on.

Let’s go through the categories of operators and how they are used:

  • Arithmetic operators: These are the traditional +, -, *, /, and % (modulo).

They expect one or more numeric (or convertible to numeric) inputs to provide a numeric result, with the usual arithmetic precedence and meaning, like in the following example:

SELECT 1 + '2' as three, (3+2) * 4 as twenty
WHERE twenty % 2 = 0;

Note that '2' is a string but can be automatically converted to the number 2. Also note that implicit conversions happen, but explicit conversions are better.

The modulo operator returns the remainder of the division of the first operator by the second, and the val % 2 = 0 pattern is often used to identify even numbers.

  • Comparison operators: These are used to test two values for equality or other comparisons, and are = (equal), != (not equal), <> (also not equal), < (less than), <= (less than or equal), > (greater than), and >= (greater than or equal).

They are typically used in the WHERE clause, but can be used anywhere a Boolean result is desired, such as in the following example:

SELECT 2 < 1 as nope, '3' != 'three' as yep
WHERE 1 != 2;

Note that the operators can compare all types where the operation is defined.

I have seen WHERE 1 = 2 used in many places to avoid returning any row, because 1=2 always returns FALSE; it would be clearer to write WHERE false directly and if you do not need to filter any row out, you can just leave the WHERE clause off, or use the WHERE true expression.

  • Logical operators: These are the traditional AND, OR, and NOT Boolean operators.

They operate only on Boolean predicates and values and return Booleans.

They are generally used in the WHERE clause, but can appear anywhere an expression is allowed, such as in the following example:

SELECT *,
  (C_ACCTBAL > 7500) AND (C_NATIONKEY = 24) as IS_TOP_US_CUST
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE (C_NAME IS NOT null) AND IS_TOP_US_CUST;

In the previous example, we have defined an IS_TOP_US_CUST column using a logical operator to combine two predicates, as we have done in the WHERE clause, also reusing the new Boolean column we created.

  • Set operators: These operators are used to put together the results from multiple queries, and are INTERSECT, MINUS or EXCEPT, and UNION [ALL].

The simplified syntax is as follows:

query_1 <set_operator> query_2

The queries must be compatible, having the same number of columns and of the same type.

It is also important that the semantics of the columns in the same position are correct, as combining people’s names with state codes is possible, as they are both strings, but in general, it does not make much sense.

Let’s describe the set operators:

  • INTERSECT returns the rows that appear in both queries, checking all columns to have the same value
  • MINUS or EXCEPT returns the rows from the first query that do not appear in the second
  • UNION [ALL] returns the rows from both queries, with ALL keeping duplicates

As an example, the following query returns data for the customers from India (8) and the US (24) and the customers in the AUTOMOBILE segment without duplicates:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (8, 24)
UNION
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_MKTSEGMENT = 'AUTOMOBILE';

In this special case, as both the queries are on the same table, we could have just used a single query with a slightly more complex WHERE clause using an OR operator to compose the two individual clauses. In real cases, you might want to combine similar data from different tables and the set operators are here for you.

  • Subquery operators: These operators allow us to use subqueries in WHERE clauses.

A subquery is a query defined inside another query.

A subquery can be used without any operator as a table-like object, such as selecting a subset of a table we want to use some data, or an expression if it returns a single value, as in the following example:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY = (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME = 'JAPAN'
);

In this example, the subquery returns only one value, so we can use the equal operator.

The subquery operators extend their use to other cases, as per their definitions:

  • ALL/ANY: Allows you to apply a comparison to all/any rows of the subquery
  • [NOT] EXISTS: Returns true if the subquery returns at least one row, false with NOT in front
  • [NOT] IN: Returns true if the expression is not included in the results of the subquery

The following example extends the previous example to the case when we want customers from more than one country, picking the country by name:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME IN ('JAPAN', 'CANADA')
);

We could easily rewrite the same query using = ANY instead of IN, and probably with some changes in the subquery also using EXISTS.

Now that we have been acquainted with the basics of querying data in SQL, let’s dive deeper into the JOIN clause, which allows us to put together data from multiple sources, which is crucial to turning raw data into useful information.

You have been reading a chapter from
Data Engineering with dbt
Published in: Jun 2023
Publisher: Packt
ISBN-13: 9781803246284
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image