Dataclasses
Since Python 3.7, dataclasses let us define ordinary objects with a clean syntax for specifying attributes. They look – superficially – very similar to named tuples. This is a pleasant approach that makes it easy to understand how they work.
Here's a dataclass
version of our Stock
example:
>>> from dataclasses import dataclass
>>> @dataclass
... class Stock:
... symbol: str
... current: float
... high: float
... low: float
For this case, the definition is nearly identical to the NamedTuple
definition.
The dataclass
function is applied as a class decorator, using the @
operator. We encountered decorators in Chapter 6, Abstract Base Classes and Operator Overloading. We'll dig into them deeply in Chapter 11, Common Design Patterns. This class definition syntax isn't much less verbose than an ordinary class with __init__()
, but it gives us access to several additional dataclass
features.
It's important to recognize that the names are provided at the class level, but are not actually creating class-level attributes. The class level names are used to build several methods, including the __init__()
method; each instance will have the expected attributes. The decorator transforms what we wrote into the more complex definition of a class with the expected attributes and parameters to __init__()
.
Because dataclass objects can be stateful, mutable objects, there are a number of extra features available. We'll start with some basics. Here's an example of creating an instance of the Stock
dataclass.
>>> s = Stock("AAPL", 123.52, 137.98, 53.15)
Once instantiated, the Stock
object can be used like any ordinary class. You can access and update attributes as follows:
>>> s
Stock(symbol='AAPL', current=123.52, high=137.98, low=53.15)
>>> s.current
123.52
>>> s.current = 122.25
>>> s
Stock(symbol='AAPL', current=122.25, high=137.98, low=53.15)
As with other objects, we can add attributes beyond those formally declared as part of the dataclass. This isn't always the best idea, but it's supported because this is an ordinary mutable object:
>>> s.unexpected_attribute = 'allowed'
>>> s.unexpected_attribute
'allowed'
Adding attributes isn't available for frozen dataclasses, which we'll talk about later in this section. At first glance, it seems like dataclasses don't give many benefits over an ordinary class definition with an appropriate constructor. Here's an ordinary class that's similar to the dataclass:
>>> class StockOrdinary:
... def __init__(self, name: str, current: float, high: float, low: ... float) -> None:
... self.name = name
... self.current = current
... self.high = high
... self.low = low
>>> s_ord = StockOrdinary("AAPL", 123.52, 137.98, 53.15)
One obvious benefit to a dataclass is we only need to state the attribute names once, saving the repetition in the __init__()
parameters and body. But wait, that's not all! The dataclass also provides a much more useful string representation than we get from the implicit superclass, object
. By default, dataclasses include an equality comparison, also. This can be turned off in the cases where it doesn't make sense. The following example compares the manually built class to these dataclass features:
>>> s_ord
<__main__.StockOrdinary object at 0x7fb833c63f10>
>>> s_ord_2 = StockOrdinary("AAPL", 123.52, 137.98, 53.15)
>>> s_ord == s_ord_2
False
The class built manually has an awful default representation, and the lack of an equality test can make life difficult. We'd prefer the behavior of the Stock
class defined as a dataclass.
>>> stock2 = Stock(symbol='AAPL', current=122.25, high=137.98, low=53.15)
>>> s == stock2
True
Class definitions decorated with @dataclass
also have many other useful features. For example, you can specify a default value for the attributes of a dataclass. Perhaps the market is currently closed and you don't know what the values for the day are:
@dataclass
class StockDefaults:
name: str
current: float = 0.0
high: float = 0.0
low: float = 0.0
You can construct this class with just the stock name; the rest of the values will take on the defaults. But you can still specify values if you prefer, as follows:
>>> StockDefaults("GOOG")
StockDefaults(name='GOOG', current=0.0, high=0.0, low=0.0)
>>> StockDefaults("GOOG", 1826.77, 1847.20, 1013.54)
StockDefaults(name='GOOG', current=1826.77, high=1847.2, low=1013.54)
We saw earlier that dataclasses support equality comparison by default. If all the attributes compare as equal, then the dataclass objects as a whole also compare as equal. By default, dataclasses do not support other comparisons, such as less than or greater than, and they can't be sorted. However, you can easily add comparisons if you wish, demonstrated as follows:
@dataclass(order=True)
class StockOrdered:
name: str
current: float = 0.0
high: float = 0.0
low: float = 0.0
It's okay to ask "Is that all that's needed?" The answer is yes. The order=True
parameter to the decorator leads to the creation of all of the comparison special methods. This change gives us the opportunity to sort and compare the instances of this class. It works like this:
>>> stock_ordered1 = StockOrdered("GOOG", 1826.77, 1847.20, 1013.54)
>>> stock_ordered2 = StockOrdered("GOOG")
>>> stock_ordered3 = StockOrdered("GOOG", 1728.28, high=1733.18, low=1666.33)
>>> stock_ordered1 < stock_ordered2
False
>>> stock_ordered1 > stock_ordered2
True
>>> from pprint import pprint
>>> pprint(sorted([stock_ordered1, stock_ordered2, stock_ordered3]))
[StockOrdered(name='GOOG', current=0.0, high=0.0, low=0.0),
StockOrdered(name='GOOG', current=1728.28, high=1733.18, low=1666.33),
StockOrdered(name='GOOG', current=1826.77, high=1847.2, low=1013.54)]
When the dataclass decorator receives the order=True
argument, it will, by default, compare the values based on each of the attributes in the order they were defined. So, in this case, it first compares the name
attribute values of the two objects. If those are the same, it compares the current
attribute values. If those are also the same, it will move on to high
and will even include low
if all the other attributes are equal. The rules follow the definition of a tuple: the order of definition is the order of comparison.
Another interesting feature of dataclasses is frozen=True
. This creates a class that's similar to a typing.NamedTuple
. There are some differences in what we get as features. We'd need to use @dataclass(frozen=True, ordered=True)
to create structures. This leads to a question of "Which is better?", which – of course – depends on the details of a given use case. We haven't explored all of the optional features of dataclasses, like initialization-only fields and the __post_init__()
method. Some applications don't need all of these features, and a simple NamedTuple
may be adequate.
There are a few other approaches. Outside the standard library, packages like attrs
, pydantic
, and marshmallow
provide attribute definition capabilities that are – in some ways – similar to dataclasses. Other packages outside the standard library offer additional features. See https://jackmckew.dev/dataclasses-vs-attrs-vs-pydantic.html for a comparison.
We've looked at two ways to create unique classes with specific attribute values, named tuples and dataclasses. It's often easier to start with dataclasses and add specialized methods. This can save us a bit of programming because some of the basics, like initialization, comparison, and string representations, are handled elegantly for us.
It's time to look at Python's built-in generic collections, dict
, list
, and set
. We'll start by exploring dictionaries.