Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Getting Started with DuckDB

You're reading from   Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Arrow left icon
Product type Paperback
Published in Jun 2024
Publisher Packt
ISBN-13 9781803241005
Length 382 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Ned Letcher Ned Letcher
Author Profile Icon Ned Letcher
Ned Letcher
Simon Aubury Simon Aubury
Author Profile Icon Simon Aubury
Simon Aubury
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Chapter 1: An Introduction to DuckDB 2. Chapter 2: Loading Data into DuckDB FREE CHAPTER 3. Chapter 3: Data Manipulation with DuckDB 4. Chapter 4: DuckDB Operations and Performance 5. Chapter 5: DuckDB Extensions 6. Chapter 6: Semi-Structured Data Manipulation 7. Chapter 7: Setting up the DuckDB Python Client 8. Chapter 8: Exploring DuckDB’s Python API 9. Chapter 9: Exploring DuckDB’s R API 10. Chapter 10: Using DuckDB Effectively 11. Chapter 11: Hands-On Exploratory Data Analysis with DuckDB 12. Chapter 12: DuckDB – The Wider Pond 13. Index 14. Other Books You May Enjoy

Why use DuckDB?

So, why might you want to use DuckDB? Let’s start by zooming all the way out. As a data practitioner, there are two broad contexts where you might find yourself getting excited about leveraging DuckDB:

  1. Using DuckDB to scale and supercharge data science, data analytics, and ad hoc data-wrangling workflows.
  2. Using DuckDB as a building block to build operational data engineering infrastructure and interactive data products.

The first of these is likely to be of interest to data practitioners with analytical workflows, such as data scientists, data analysts, and machine learning engineers, whereas the second is more likely to be relevant to data engineers and machine learning engineers building data infrastructure, as well as software engineers building user-facing data products. In this book, we’ll be focusing more on using DuckDB to supercharge analytical workflows; however, if you’re looking to use DuckDB for building operational data infrastructure and data products, this book will still be a great starting point to get you up to speed with DuckDB’s capabilities that make it well suited to these kinds of applications.

In this section, we’ll first go through some use cases that land in DuckDB’s sweet spot, before looking at DuckDB’s features that make it especially well-suited to these applications. We’ll finish up by discussing contexts where other tools may be more appropriate.

DuckDB use cases

DuckDB is an incredibly versatile tool for analytical data processing and management, so any attempt to describe its full range of potential applications will almost certainly be incomplete. To give you a sense of the flavor of possible applications, we’ll go through a range of use cases for DuckDB across the two broad categories mentioned previously: analytical workflows and building operational data infrastructure and products.

Supporting analytical workflows

A major component of the workflows of data scientists and data analysts is activities that involve processing often quite large datasets, from cleaning data, transforming data into the right shape, structured data modeling, running statistical algorithms, and training machine learning models. If you talk to a practitioner who has been in the trenches for a while, they will likely tell you that sometimes they feel like they’re spending more time fighting the tools they use for these tasks than they are being productive. Often the size of the data is a limiting factor, with many popular data processing tools, such as pandas dataframes in Python and dataframes in R, simply not being able to handle the size of target datasets within the memory of your workstation or, if they can, taking a frustrating amount of time to process.

Once you’ve hit the limits of your local machine, conventional wisdom is that you need to take your workload to a distributed data compute framework such as Apache Spark, Dask, or Ray, or perhaps ingest your data into a cloud data warehouse or a data lake, where a distributed SQL query engine such as Google BigQuery, Trino, or Amazon Athena can be used to run queries at scale. These solutions significantly increase the complexity of your workflows, requiring complex supporting infrastructure that must be managed and maintained, with a hefty price tag often associated with such managed services. If you’re lucky enough to have access to these tools, they still come with additional challenges, such as working with unfamiliar or constrained interfaces, and when things go wrong, you may often find yourself having to debug arcane and confusing stack traces from the underlying compute engine.

This is where DuckDB can come to the rescue, offering the simplicity of an in-process tool, with a familiar SQL interface (as well as non-SQL interfaces if you prefer) that is optimized for running complex OLAP queries over large datasets. Not only is DuckDB blazingly fast, but it is also able to handle out-of-core workloads (datasets that don’t fit into memory), enabling you to scale your workflows on a single machine much further before you need to consider more complex distributed data processing solutions.

In recent times, there have been developments in dataframe libraries that help address the performance limitations of tools such as pandas dataframes and R dataframes, such as Dask and Modin, which allow you to perform simple parallelization of dataframe operations across your CPU cores, as well as providing on-ramps to run the same queries across a distributed cluster. We also have dataframe libraries such as Polars and Vaex, which are built on top of Apache Arrow, providing more efficient memory utilization, parallelization, and the ability to handle some out-of-core workloads. These innovations in the data ecosystem are pleasing to see; however, these tools are still ultimately dataframe tools, focusing primarily on querying and data transformation—they do not give you the data management features of a DBMS.

By virtue of being a fully-fledged DBMS, DuckDB provides a range of affordances that data practitioners may not realize they’re missing from their current analytical processing workflows:

  • DuckDB provides transactional guarantees through ACID properties (atomicity, consistency, isolation, and durability), meaning that you don’t have to worry about corrupted data if your Python or R process crashes midway through a job.
  • Data integrity guarantees can be enabled through the specification of constraints that enforce properties over data inserted into tables. DuckDB allows you to specify PRIMARY KEY constraints, which enforce uniqueness across rows within a table, FOREIGN KEY constraints, which enforce referential integrity for relationships across tables, and NOT NULL constraints over column values. DuckDB also provides the ability to apply arbitrary CHECK constraints to column values in the form of Boolean expressions, such as ensuring that string identifiers only contain alphanumeric characters.
  • While you can use DuckDB as an entirely in-memory database, its database can also be persisted to disk and used across processes, even allowing multiple processes to read concurrently. This enables workflows and consuming patterns that dataframe libraries cannot readily support on their own.
  • DuckDB also includes a rich suite of data integrations, with an eye toward performance. Notable examples include optimized CSV, Parquet, and JSON loaders, which can read files in parallel, the ability to read Hive-partitioned files, and the PostgreSQL, MySQL, and SQLite extensions, which allow DuckDB to query directly from source tables in external databases, rather than having to rely on bulk imports that must be periodically refreshed.

When a data team starts to hit the limits of their existing tooling, whether due to missing data management features or insufficient performance, it’s common for the team to start building out their own bespoke tools and packages. Since this kind of custom tooling is typically not core to the value the team is providing, these resources can suffer from defects due to insufficient resources being able to be dedicated to their development. Using a well-maintained and tested DBMS that is optimized for analytical workloads removes the busywork that is associated with maintaining tooling that doesn’t represent your core value proposition.

DuckDB’s powerful feature set makes it a versatile tool for a range of analytical workflows, whether you’re performing exploratory data analysis (EDA), quickly transforming between common data formats, or building complex data science pipelines. DuckDB enables you to slurp up large datasets from across heterogeneous data sources, with a rich set of features for cleaning dirty data and normalizing inconsistent schema, through a simple interface and with blazing performance. DuckDB also has great integrations with familiar analytical tools commonly used in the data ecosystem, allowing you to mix and match DuckDB with complementary tools to assemble your own effective workflows. For example, DuckDB can query directly from and write to pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. It also offers the ability to use alternative query interfaces to SQL that may be more familiar to data scientists and data analysts, such as dplyr in R and Ibis in Python. In addition to being a powerful workhorse for complex analytical queries, all this versatility makes DuckDB a valuable Swiss Army knife that is worth having in your analytical data toolkit.

Finally, data scientists and data analysts often find themselves building custom interactive data apps or dashboards for use as proof of concepts (POCs), bespoke tools that support common workflows, or for publishing internal decision-support tools within their organization. Powerful open source dashboarding tools such as R Shiny, Streamlit, and Plotly Dash streamline the development of such data apps; however, they typically leave the integration of a data source up to the developer. DuckDB is a perfect complement to these tools, offering simple in-process integration with no external dependencies and enabling fast analytical querying performance, which is important for low-latency response times that improve the user experience of your data apps. We unpack this particular application of DuckDB further in the DuckDB-powered data apps section in Chapter 12.

Building data infrastructure

While much of the explosive growth and excitement that DuckDB has seen has been driven by folks adopting it for the types of analytical workflows we have just discussed, there is another area of application that is starting to show increased amounts of activity and demand for, which draws upon similar themes of doing more with less and simplifying and streamlining workloads. This sees DuckDB being used as a building block in modern data infrastructure for use cases that involve small-to-medium data rather than truly big data, as well as use cases that require low-latency responsiveness for consumer-facing interactive data apps. Common to these applications is a shift away from the paradigm of moving your compute to the data, which is seen as the conventional wisdom for effectively working with big data, and a move toward bringing your data to the compute. For smaller data workloads, this can be faster, more efficient, and cheaper to build and maintain.

Much of the development in modern data processing technologies has been dominated by the needs of hyperscale organizations, with scale-out tools such as MapReduce, Hadoop, and Apache Spark, as well as cloud data warehouses such as Snowflake and BigQuery, dominating the landscape. Most organizations, however, do not operate at hyperscale, and oftentimes data processing needs are quite moderate in comparison. The cleaned and enriched datasets that drive the modern data-informed business—providing business intelligence (BI) across sales, marketing, growth, and product innovation—tend not to reach the petabyte scale. There is an opportunity for data teams in many organizations to adopt leaner data architectures that are optimized for more moderate data workloads and that come with the benefits of reduced complexity and much lower total cost of ownership.

DuckDB’s performance characteristics make it well placed to be a core building block in such architectures. Some examples include the following:

  • Using DuckDB to perform transformations in extract, transform, and load (ETL) pipelines as an alternative to tools such as Apache Spark. Compute instances can be spun up on demand, invoking DuckDB to pull down and transform data.
  • For data lake contexts, where structured and semi-structured data has been landed in object storage, DuckDB can be used as a lightweight alternative to distributed SQL query engines, which data teams might otherwise reach for, such as Google BigQuery, Amazon Athena, Trino, and Dremio.
  • For some scenarios, DuckDB also offers the potential to replace the use of cloud data warehouses such as Snowflake or OLAP engines such as ClickHouse, where utilization of these powerful resources would be low. If your organization is only consuming a handful of data sources to produce conformed tables that drive a small number of reporting use cases, then using DuckDB to build small, targeted data cubes may well be sufficient for your needs.

Some folks have already started to roll their own solutions for adopting these architectures. See, for example, the Modern Data Stack in a Box with DuckDB post by Jacob Matson (https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html), which explores the use of open source tools to create an end-to-end lightweight data stack, with DuckDB at its core. Another post, Build a poor man’s data lake from scratch with DuckDB, by Pete Hunt and Sandy Ryza (https://dagster.io/blog/duckdb-data-lake), explores using DuckDB as a SQL query engine on top of a data lake. Meanwhile, there are also companies emerging that are oriented around offering hosted platforms that provide serverless analytics platforms driven by DuckDB, the most notable example being MotherDuck (https://motherduck.com).

Another area where traditional scale-out approaches to data processing have shown to be not always fit for purpose is around interactive data applications, such as BI dashboards and bespoke data apps. In such applications, low-latency query results in response to user interaction are crucial for supporting dynamic and ad hoc workloads with a positive user experience. However, most cloud data warehouses and distributed data processing engines are simply not able to provide the low-latency response times required for these types of workloads and must be augmented with different types of pre-aggregation and caching strategies, often in the form of separate service, which further increases complexity and architectural surface area. DuckDB’s blazing fast speeds over analytical workloads make it a compelling choice for being the backing query engine for interactive data applications. For example, the hosted BI service Mode recently switched to using DuckDB as their in-memory query engine in order to improve the speed of queries (https://mode.com/blog/how-we-switched-in-memory-data-engine-to-duck-db-to-boost-visual-data-exploration-speed). Hex and Observable are two hosted data analytics notebook services offering rich visualizations and interactivity that both recently added DuckDB integration to supercharge users’ workflows. Another notable example is Rill Data Developer, an open source tool for building dashboards, which is built on DuckDB to provide rapid response times for queries.

The use of DuckDB as a building block for data infrastructure and interactive data applications is a notable emerging trend and one we think is worth paying attention to. In the next section, we’ll further unpack the features of DuckDB that serve to make it appealing for both analytical workflows and building operational data infrastructure and data products.

DuckDB features

You may find yourself asking, what makes DuckDB so well suited to scaling analytical workflows and being used as a building block in data infrastructure? Here are some key features of DuckDB that have led to it increasingly being adopted by data practitioners.

Performance

DuckDB is optimized for OLAP workloads, making it blazingly fast and efficient for the kinds of queries frequently seen in analytical workflows. It achieves this through a range of design choices and optimizations:

  • As with most modern OLAP engines, DuckDB employs a column-based execution model to enable better performance over operations that are characteristic of analytical workloads. DuckDB uses a highly tuned vectorized query engine that works by processing chunks of columns at a time. Operating on column chunks rather than entire columns means that queries involving multiple operations, which require intermediate aggregations, are less likely to result in out-of-memory errors. The chunks are also tuned to be small enough so that they remain inside the CPU’s ultra-low latency L1 cache—the CPU’s fasted dedicated memory, which is drastically faster than main memory.
  • DuckDB leverages a range of compression algorithms, which exploit similarities in values within columns, to reduce its storage size on disk, which in turn improves read times.
  • DuckDB employs an end-to-end query optimizer. This means that rather than executing queries as they are written, DuckDB can automatically rewrite queries to be much more efficient.
  • Almost all of DuckDB’s operations come with automatic parallelism, allowing it to distribute operations over multiple CPU threads, resulting in reduced processing time.

DuckDB is also able to support out-of-core workloads, where the data to be processed does not fit within available memory. It does this by spilling over into temporary disk storage when memory is exhausted. This does increase processing times due to the slower read times of persistent storage compared to memory; however, this is typically preferable to the query failing outright. These costs can also be mitigated by the selective use of low-latency SSD drives for applications where this is a concern.

Ease of use

The design choice of operating in-process means that users of DuckDB don’t need to concern themselves with installing, maintaining, and authenticating with a standalone database server. Another key design decision of DuckDB was for it not to make use of any third-party dependencies. This makes DuckDB extremely portable across platforms and has also enabled DuckDB to be made available for a wide range of languages and runtimes. This feature of DuckDB has increased its accessibility to a diverse range of consumers, allowing it to be readily incorporated into a wide variety of workflows and tech stacks.

DuckDB also has a strong focus on improving the ergonomics of working with SQL. It has a PostgreSQL-like SQL dialect, making it familiar to many data practitioners, and also includes a wide range of alternative function aliases, matching names used in other popular databases that many practitioners will be familiar with. Notably, DuckDB’s SQL dialect has a range of enhancements designed to improve productivity when writing analytical SQL queries. Some of these include the following:

  • Automatic casting of data types where possible, which serves to simplify SQL queries.
  • Simple creation of LIST and STRUCT data types using literal values.
  • Accessing attributes of STRUCT data types using dot notation.
  • Simple string and list slicing syntax similar to Python.
  • The ability to define anonymous lambda functions within SQL queries that can be used for transforming and filtering lists.
  • List comprehension syntax similar to Python’s, enabling more convenient list transformation and filtering.
  • Improved ergonomics for column selection using SELECT *, with the EXCLUDE and REPLACE clauses, which allow you to leverage wildcard selection for more concise queries, while still being able to exclude or transform specific columns. This pattern-matching power can be further enhanced with the COLUMNS() expression, which allows you to select columns that match a regular expression or the output of a lambda function.
  • Column aliases that can be reused by subsequent columns in the same SELECT statement, as well as in WHERE and ORDER BY clauses.
  • The ability to start SELECT statements with its FROM clause, allowing you to improve the readability of queries by frontloading the data source. Additionally, omitting the SELECT clause entirely is interpreted as having specified SELECT *, making this common query pattern more concise.
  • Function-call chaining within SQL queries, similar to familiar data processing APIs such as that of pandas, which is designed around method-call chaining.
  • Trailing commas in SQL queries.

We’ll cover some of these features in Chapter 10. For a more comprehensive treatment of the friendly SQL enhancements that DuckDB’s SQL dialect provides, see the DuckDB documentation: https://duckdb.org/docs/guides/sql_features/friendly_sql.

Versatility

DuckDB comes enabled with a range of affordances that make it an incredibly versatile data processing and management tool. In terms of data sources, DuckDB can read and write data formats that are the mainstays of the data world: Parquet, CSV, and JSON (including newline-delimited JSON). In addition to reading from disk, these files can be read from remotely hosted files, and DuckDB can both read and write cloud object storage services using the Simple Storage Service (S3) API. DuckDB is also able to directly connect to and query from multiple databases at a time, including PostgreSQL, MySQL, and SQLite databases.

DuckDB also has tight integrations with in-memory data formats that are commonly used in the analytical data ecosystem, including pandas and Polars dataframes and R dataframes, as well as Apache Arrow tables. DuckDB provides the ability to query directly from these in-memory data structures, as well as export to them.

DuckDB’s rich set of official clients also means that it can be used in a range of contexts beyond Python- and R-based workflows. Of particular note here is the DuckDB WebAssembly (Wasm) client, which enables developers to readily integrate DuckDB into web apps that can be published to the internet for anyone to access with a browser.

Together, all these capabilities make DuckDB a veritable data Swiss Army knife for working with analytical data, opening up many interesting applications that we have only just scratched the surface of.

Powerful analytics capabilities

DuckDB offers extensive support for composing complex queries through SQL, with a particular emphasis on features commonly used in analytical workloads. DuckDB has support for optimized aggregation and join operations, column indexes, window functions, and full-text search. DuckDB supports a wide range of functions for working with different types of data, including numeric operations, date and time operations, date formatting, text operations, bit strings, and string pattern matching, including regular expressions. Analytical workloads are further enabled by SQL commands such as SAMPLE, which provides convenient probabilistic sampling of datasets, the PIVOT command for creating pivot tables, ASOF joins for interpolating missing values when joining time series data, and the QUALIFY clause, for convenient filtering of window-functions results.

DuckDB also has a rich array of data types, which, in addition to those you’d expect, such as numeric, date, and text types, also includes handy types such as INTERVAL for date ranges, ENUM for enumerations, and powerful nested data types for holding multiple values, which include ARRAY, LIST, STRUCT, MAP, and UNION. DuckDB also offers support for analytical applications that involve working with geospatial data, using its spatial extension.

As this book was nearing completion, DuckDB released a vector similarity search extension, which enables using DuckDB’s ARRAY data type for machine learning and data science applications involving vector embeddings and semantic similarity search.

Engaged community

DuckDB is open sourced under the permissive MIT license, making it readily adoptable and extensible for a wide range of commercial and non-commercial applications. The community that has formed around DuckDB has generated many valuable open source extensions, integrations, and alternative DuckDB clients for different languages and runtimes. This rich ecosystem of DuckDB-oriented projects is indicative of and has been a powerful catalyst for the enthusiasm behind DuckDB.

When is DuckDB not the right tool?

As we have already discussed, when it comes to databases, there is no one-size-fits-all solution. This means that DuckDB won’t always be the right tool for the job.

Since it is optimized for analytical workloads running on a single machine, DuckDB has been intentionally designed to perform well under a specific set of access patterns, which you should confirm are acceptable for your use case. If your workloads correspond more to the OLTP paradigm, with many individual record transactions, including frequent writes, then DuckDB, which is optimized for OLAP workloads consisting mostly of read access and column-oriented queries, is likely not the best tool for you. If you’re after an in-process DBMS that’s optimized for OLTP workloads, then SQLite is hard to beat. Another specific consideration is that in order to open a DuckDB database that can be written to, only one process can both read from and write to the database. If you need multiple processes to be able to concurrently read from a DuckDB database, you must load it in read-only mode.

DuckDB’s performance is truly impressive, allowing you to push the limits of what a single machine can do, arguably redefining what actually constitutes big data for analytical workloads. However, it is ultimately a database that operates in-process on a single machine, and so there are limits to how far it can be pushed. If your target workloads involve regularly processing petabyte-scale datasets, then you will likely need to use some form of distributed data processing-enabled platform.

Now that we’ve established what DuckDB is and when it makes sense to use it, we’ll start to get more practical by looking at DuckDB’s deployment options and how you can get started working with it.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image