You're reading from Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781803241005

Length 382 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (2):

Ned Letcher

Simon Aubury

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: An Introduction to DuckDB

2. Chapter 2: Loading Data into DuckDB FREE CHAPTER

3. Chapter 3: Data Manipulation with DuckDB

4. Chapter 4: DuckDB Operations and Performance

5. Chapter 5: DuckDB Extensions

6. Chapter 6: Semi-Structured Data Manipulation

7. Chapter 7: Setting up the DuckDB Python Client

8. Chapter 8: Exploring DuckDB’s Python API

9. Chapter 9: Exploring DuckDB’s R API

10. Chapter 10: Using DuckDB Effectively

11. Chapter 11: Hands-On Exploratory Data Analysis with DuckDB

12. Chapter 12: DuckDB – The Wider Pond

13. Index

Why subscribe?

14. Other Books You May Enjoy

DuckDB deployment options and installation

In this section, we’ll look at ways you can use and integrate DuckDB into your analytical workflows, operational infrastructure, and data products. We’ll start by outlining the different clients available for DuckDB, and then we’ll go through how to get started working with the DuckDB command-line interface (CLI) on your own machine.

DuckDB deployment options

The data technology landscape is a big place, full of a diverse range of data practitioners with different skill sets and a wide range of tools built on a heterogeneous set of technologies. To cater to this diversity, DuckDB is made available via a number of different official client APIs, covering users of Python, R, JavaScript, Rust, Swift, Julia, Java, C, and C++. In the case of JavaScript, there are two clients: one for Node.js, oriented around backend applications, and one for Wasm, which allows DuckDB to run entirely within any modern web browser. Additionally, DuckDB is also made available as a cross-platform CLI, which is distributed as an executable that can be run virtually anywhere you have access to a command line. At the time of writing, there are also community-supported clients for Rust, Go, C#, Ruby, Common Lisp, Zig, and Crystal. As this list is ever-growing, we encourage you to consult the DuckDB documentation for an up-to-date list of official DuckDB clients and third-party contributed clients: https://duckdb.org/docs/api/overview.

This large selection of clients and integrations provides good coverage over languages and runtimes that meet the needs of a wide range of data practitioners. Here are the clients we’ll cover in this book:

The DuckDB CLI is a cross-platform executable that gives you a CLI for interacting with DuckDB databases via SQL. It’s compiled without any external dependencies, meaning that you can run it virtually anywhere you have access to a terminal. Its ease of installation and portability make it a great way to get started with DuckDB, which is why we will be making use of it for many of the hands-on examples in this book. We’ll walk through how to install it and get started using it shortly.
The DuckDB Python client enables Python processes to readily communicate with DuckDB databases. It offers a number of distinct APIs for querying and interacting with DuckDB, making it suitable for a range of applications, spanning the spectrum of interactive data analysis to building data infrastructure and analytical data products. In Chapter 7, we introduce the DuckDB Python client, focusing in particular on how to connect to DuckDB databases. In Chapter 8, we go on a deeper dive into the DuckDB Python client, focusing on two of the APIs that it exposes: the DuckDB-specific Relational API and the Python Database API (DB-API). In Chapter 11, we’ll use the Python client again, this time to perform EDA of a dataset in a Jupyter Notebook. Finally, in Chapter 12’s Alternative DuckDB query interfaces section, we touch on DuckDB’s Spark API—another component of the DuckDB Python client—which enables interacting with DuckDB databases using PySpark queries.
The DuckDB R client enables R sessions to connect to and work with DuckDB databases. The client provides support for connecting to DuckDB via the R database interface, as well as dbplyr integration, meaning that R users can query DuckDB databases using the powerful and popular dplyr interface, as an alternative to SQL. Along with DuckDB’s core features, this makes DuckDB a powerful component of an R analytical toolchain. We cover all this in Chapter 9, where we go on a tour of the DuckDB R client.
The DuckDB Wasm client is a full DuckDB client that has been compiled to run on Wasm, which is a virtual machine (VM) that runs on all modern browsers. With DuckDB Wasm, you can build web apps in JavaScript that can interact with DuckDB databases, running completely on client devices. This opens up a range of compelling possibilities for building lean analytical data apps with low-latency response times. In Chapter 12’s DuckDB-powered data apps section, we’ll unpack these possibilities further, identifying contexts where you might want to consider adopting DuckDB Wasm for building data apps. We’ll also cover using the DuckDB Web shell, a DuckDB CLI that runs completely within the browser, which you can try out online: https://shell.duckdb.org.
DuckDB provides clients for both the Open Database Connectivity (ODBC) API and the Java Database Connectivity (JDBC) API. These are both important workhorses in the data ecosystem, being frequently used to connect analytical data applications, such as BI tools, to databases for querying. We discuss both these DuckDB integration targets in the DuckDB integration section in Chapter 12.
The DuckDB Arrow Database Connectivity (ADBC) client provides an implementation of the ADBC API. This plays an analogous role to the JDBC and ODBC clients, enabling applications to connect to DuckDB databases as a data source, with the notable difference that the ADBC client makes use of Apache Arrow, an in-memory columnar data format. This is a much better fit for analytical applications, addressing the inefficiencies associated with the use of the OLTP-oriented JDBC and ODBC APIs. We’ll discuss the DuckDB ADBC client in Chapter 12’s DuckDB integration section.

Next, we’ll get the DuckDB CLI installed so that we’re ready to dive into some hands-on DuckDB examples.

Installing the DuckDB CLI

The DuckDB CLI is made available for Windows, macOS, and Linux operating systems. For up-to-date installation options and instructions, go to the DuckDB installation page and ensure that you have the appropriate operating system for your machine selected: https://duckdb.org/docs/installation. You can choose between using a package manager to install DuckDB onto your system, or directly downloading a ZIP file that contains the DuckDB CLI executable. If you haven’t used the package manager indicated in the instructions for your operating system, we suggest you take the direct download approach. Please proceed now to either install DuckDB using the package manager or download the DuckDB CLI ZIP file. If downloading directly, once the file has downloaded, unzip it and extract the DuckDB executable file into the directory you plan to work in.

Which version of DuckDB?

In this book, the output you’ll see comes from the most recent version of DuckDB available at the time of writing (April 2024), which is 0.10.2. DuckDB Labs has indicated that the intention is for no new features to be added between this release and the 1.0.0 release, with the focus being on stability and robustness. We therefore recommend that you use the most recent version of DuckDB to work through these exercises. You may see some minor differences in the output of commands and error messages with later versions; however, this is unlikely to interfere with the exercises. If you do come across any unexpected behavior, as always, the DuckDB documentation should be considered the source of truth: https://duckdb.org/docs. We will endeavor to keep the code in the GitHub repository for this book up to date in the event that there are breaking changes: https://github.com/PacktPublishing/Getting-Started-with-DuckDB.

Starting the DuckDB CLI

Since we’re working with a CLI, you’ll need to open a terminal application. You may have a preferred terminal application you already use, or you can simply use the one that comes with your operating system. Note that for Windows, you can use either PowerShell or the cmd.exe application. We’ll refer to the terminal application you’ve chosen as your terminal from here on.

Start by loading up your terminal and changing directory to the one you plan to work in. If you downloaded the DuckDB executable, this should be in the same directory as the one you placed the unzipped DuckDB executable in. Then, run one of the following commands appropriate for your context to start the DuckDB CLI.

Run the following command if you installed DuckDB using a package manager or if you are using Windows’ cmd.exe application for your terminal application and downloaded the executable:

$ duckdb

If you are using PowerShell on Windows and downloaded the executable, run the following command:

$ .\duckdb

For macOS and Linux terminal applications and where you downloaded the executable directly, run this command:

$ ./duckdb

Note which alternative you used, as you may need to adapt subsequent duckdb executable invocations accordingly. Your terminal should now look something like this:

v0.10.2 1601d94f94
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

This indicates that you’re now inside the DuckDB shell, which gives you an interactive read-eval-print loop (REPL), ready for you to start entering commands and interacting with DuckDB.

Working with the DuckDB CLI

Within the DuckDB shell, the D symbol indicates a waiting prompt, meaning that it’s ready to accept input. In the next section, we’ll go through some SQL basics and run through several SQL statements for inserting and querying data. For now, let’s run a very simple query that retrieves a made-up record of values that we have specified within the query:

SELECT 'hello' AS greeting, 'world' AS name;

This query produces the following output:

┌──────────┬─────────┐
│ greeting │  name   │
│ varchar  │ varchar │
├──────────┼─────────┤
│ hello    │ world   │
└──────────┴─────────┘

After running the query, you will immediately see the resulting record, rendered as a table with a single row and corresponding column headers that indicate both the name and data type of each column.

You may have noticed that when we opened the shell, DuckDB informed us that it was connected to a transient in-memory database. This means that when you close the shell, the database being used by DuckDB, along with any data it contains, won’t be persisted anywhere. For many applications, this is perfectly fine. Many ad hoc data analysis or transformation activities can be performed within a single session, with the final results being written to disk in an appropriate file format. As you continue exploring DuckDB use cases, you will discover contexts where it will be useful to persist cleaned and prepared tables to disk so that they can be reused across subsequent sessions. For these contexts, DuckDB supports opening a database as a persistent disk-based file. We can achieve this in two ways using the DuckDB CLI.

One is via the command line, by passing the path to an existing DuckDB database file as an argument to the duckdb executable when you load the DuckDB CLI. To try this out, first, make sure you exit any running DuckDB CLI shell by executing the .quit command. Then enter the following command in your terminal:

$ ./duckdb my_db.duckdb

The other way to open a persistent file-based database is via an already running DuckDB CLI shell, using the .open CLI command followed by the path to your desired database file:

.open my_db.duckdb

In both cases, this will result in an open connection to a disk-based DuckDB database stored in the my_db.duckdb file. If this file already exists on disk, DuckDB will load and start using it. If the file does not already exist, DuckDB will first create it before connecting to it. When you close the shell, any outstanding changes to the database will be written safely to the file.

The .open command is just one of a number of special dot commands available in DuckDB’s shell. For example, the .mode command changes the formatting used to display tables that are returned after running a query. This is useful for quickly copying and pasting query results in other formats, such as CSV and JSON. For example, let’s render the output of the query we ran previously in an HTML table. To do this, first change the output mode to html and then rerun the query:

.mode html
SELECT 'hello' AS greeting, 'world' AS name;

We now see the same result but formatted as an HTML table:

<TR><TH>greeting</TH>
<TH>name</TH>
</TR>
<TR><TD>hello</TD>
<TD>world</TD>
</TR>

We can also change DuckDB’s output mode at the command line. By combining this feature with DuckDB’s ability to pass SQL commands to the duckdb executable, we can start to see a glimpse of some of the versatility DuckDB has to offer:

$ ./duckdb -markdown -s "SELECT 'hello' AS greeting, 'world' AS name;"

Running this command on your terminal results in the same output, this time formatted in markdown:

| greeting | name  |
|----------|-------|
| hello    | world |

To see the available command-line parameters the duckdb executable supports, we can pass it the --help flag:

$ ./duckdb --help

For a complete list of output modes DuckDB supports, as well as a reference for other available dot commands, see the DuckDB CLI documentation: https://duckdb.org/docs/api/cli.

Now that we’ve got the DuckDB CLI set up and have seen how to work with it, we’re ready to start our adventures getting started with DuckDB. In the next section, we provide a very brief introduction to working with SQL, via the DuckDB CLI. If you’re already well versed in writing SQL, you’re welcome to skim through or skip this section. For those of you newer to working with SQL, or if you haven’t used it for a while, this will make sure that you’re across some fundamentals we’ll be assuming as we dive into exploring DuckDB’s features.

The rest of the chapter is locked