Docstrings and annotations

This section is about documenting the code in Python, from within the code. Good code is self-explanatory but is also well-documented. It is a good idea to explain what it is supposed to do (not how).

One important distinction; documenting the code is not the same as adding comments on it. Comments are bad, and they should be avoided. By documentation, we refer to the fact of explaining the data types, providing examples of them, and annotating the variables.

This is relevant in Python, because being dynamically typed, it might be easy to get lost on the values of variables or objects across functions and methods. For this reason, stating this information will make it easier for future readers of the code.

There is another reason that specifically relates to annotations. They can also help in running some automatic checks, such as type hinting, through tools such as Mypy. We will find that, in the end, adding annotations pays off.

Docstrings

In simple terms, we can say that docstrings are basically documentation embedded in the source code. A docstring is basically a literal string, placed somewhere in the code, with the intention of documenting that part of the logic.

Notice the emphasis on the word documentation. This subtlety is important because it's meant to represent explanation, not justification. Docstrings are not comments; they are documentation.

Having comments in the code is a bad practice for multiple reasons. First, comments represent our failure to express our ideas in the code. If we actually have to explain why or how we are doing something, then that code is probably not good enough. For starters, it fails to be self-explanatory. Second, it can be misleading. Worst than having to spend some time reading a complicated section is to read a comment on how it is supposed to work, and figuring out that the code actually does something different. People tend to forget to update comments when they change the code, so the comment next to the line that was just changed will be outdated, resulting in a dangerous misdirection.

Sometimes, on rare occasions, we cannot avoid having comments. Maybe there is an error on a third-party library that we have to circumvent. In those cases, placing a small but descriptive comment might be acceptable.

With docstrings, however, the story is different. Again, they do not represent comments, but the documentation of a particular component (a module, class, method, or function) in the code. Their use is not only accepted but also encouraged. It is a good practice to add docstrings whenever possible.

The reason why they are a good thing to have in the code (or maybe even required, depending on the standards of your project) is that Python is dynamically typed. This means that, for example, a function can take anything as the value for any of its parameters. Python will not enforce, nor check, anything like this. So, imagine that you find a function in the code that you know you will have to modify. You are even lucky enough that the function has a descriptive name, and that its parameters do as well. It might still not be quite clear what types you should pass to it. Even if this is the case, how are they expected to be used?

Here is where a good docstring might be of help. Documenting the expected input and output of a function is a good practice that will help the readers of that function understand how it is supposed to work.

Consider this good example from the standard library:

In [1]: dict.update??
Docstring:
D.update([E, ]**F) -> None. Update D from dict/iterable E and F.
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k]
If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v
In either case, this is followed by: for k in F: D[k] = F[k]
Type: method_descriptor

Here, the docstring for the update method on dictionaries gives us useful information, and it is telling us that we can use it in different ways:

We can pass something with a .keys() method (for example, another dictionary), and it will update the original dictionary with the keys from the object passed per parameter:

>>> d = {}
>>> d.update({1: "one", 2: "two"})
>>> d
{1: 'one', 2: 'two'}

We can pass an iterable of pairs of keys and values, and we will unpack them to update:

>>> d.update([(3, "three"), (4, "four")])
>>> d
{1: 'one', 2: 'two', 3: 'three', 4: 'four'}

In any case, the dictionary will be updated with the rest of the keyword arguments passed to it.

This information is crucial for someone that has to learn and understand how a new function works, and how they can take advantage of it.

Notice that in the first example, we obtained the docstring of the function by using the double question mark on it (dict.update??). This is a feature of the IPython interactive interpreter. When this is called, it will print the docstring of the object you are expecting. Now, imagine that in the same way, we obtained help from this function of the standard library; how much easier could you make the lives of your readers (the users of your code), if you place docstrings on the functions you write so that others can understand their workings in the same way?

The docstring is not something separated or isolated from the code. It becomes part of the code, and you can access it. When an object has a docstring defined, this becomes part of it via its __doc__ attribute:

>>> def my_function():
 ... """Run some computation"""
 ... return None
 ...
 >>> my_function.__doc__
 'Run some computation'

This means that it is even possible to access it at runtime and even generate or compile documentation from the source code. In fact, there are tools for that. If you run Sphinx, it will create the basic scaffold for the documentation of your project. With the autodoc extension (sphinx.ext.autodoc) in particular, the tool will take the docstrings from the code and place them in the pages that document the function.

Once you have the tools in place to build the documentation, make it public so that it becomes part of the project itself. For open source projects, you can use read the docs, which will generate the documentation automatically per branch or version (configurable). For companies or projects, you can have the same tools or configure these services on-premise, but regardless of this decision, the important part is that the documentation should be ready and available to all members of the team.

There is, unfortunately, one downside to docstrings, and it is that, as it happens with all documentation, it requires manual and constant maintenance. As the code changes, it will have to be updated. Another problem is that for docstrings to be really useful, they have to be detailed, which requires multiple lines.

Maintaining proper documentation is a software engineering challenge that we cannot escape from. It also makes sense to be like this. If you think about it, the reason for documentation to be manually written is because it is intended to be read by other humans. If it were automated, it would probably not be of much use. For the documentation to be of any value, everyone on the team must agree that it is something that requires manual intervention, hence the effort required. The key is to understand that software is not just about code. The documentation that comes with it is also part of the deliverable. Therefore, when someone is making a change on a function, it is equally important to also update the corresponding part of the documentation to the code that was just changed, regardless of whether its a wiki, a user manual, a README file, or several docstrings.

Annotations

PEP-3107 introduced the concept of annotations. The basic idea of them is to hint to the readers of the code about what to expect as values of arguments in functions. The use of the word hint is not casual; annotations enable type hinting, which we will discuss later on in this chapter, after the first introduction to annotations.

Annotations let you specify the expected type of some variables that have been defined. It is actually not only about the types, but any kind of metadata that can help you get a better idea of what that variable actually represents.

Consider the following example:

class Point:
    def __init__(self, lat, long):
        self.lat = lat
        self.long = long
 
 
def locate(latitude: float, longitude: float) -> Point:
    """Find an object in the map by its coordinates"""

Here, we use float to indicate the expected types of latitude and longitude. This is merely informative for the reader of the function so that they can get an idea of these expected types. Python will not check these types nor enforce them.

We can also specify the expected type of the returned value of the function. In this case, Point is a user-defined class, so it will mean that whatever is returned will be an instance of Point.

However, types or built-ins are not the only kind of thing we can use as annotations. Basically, everything that is valid in the scope of the current Python interpreter could be placed there. For example, a string explaining the intention of the variable, a callable to be used as a callback or validation function, and so on.

With the introduction of annotations, a new special attribute is also included, and it is __annotations__. This will give us access to a dictionary that maps the name of the annotations (as keys in the dictionary) with their corresponding values, which are those we have defined for them. In our example, this will look like the following:

>>> locate.__annotations__
 {'latitude': float, 'longitue': float, 'return': __main__.Point}

We could use this to generate documentation, run validations, or enforce checks in our code if we think we have to.

Speaking of checking the code through annotations, this is when PEP-484 comes into play. This PEP specifies the basics of type hinting; the idea of checking the types of our functions via annotations. Just to be clear again, and quoting PEP-484 itself:

"Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention."

The idea of type hinting is to have extra tools (independent from the interpreter) to check and assess the correct use of types throughout the code and to hint to the user in case any incompatibilities are detected. The tool that runs these checks, Mypy, is explained in more detail in a later section, where we will talk about using and configuring the tools for the project. For now, you can think of it as a sort of linter that will check the semantics of the types used on the code. This sometimes helps in finding bugs early on, when the tests and checks are run. For this reason, it is a good idea to configure Mypy on the project and use it at the same level as the rest of the tools for static analysis.

However, type hinting means more than just a tool for checking the types on the code. Starting with Python 3.5, the new typing module was introduced, and this significantly improved how we define the types and the annotations in our Python code.

The basic idea behind this is that now the semantics extend to more meaningful concepts, making it even easier for us (humans) to understand what the code means, or what is expected at a given point. For example, you could have a function that worked with lists or tuples in one of its parameters, and you would have put one of these two types as the annotation, or even a string explaining it. But with this module, it is possible to tell Python that it expects an iterable or a sequence. You can even identify the type or the values on it; for example, that it takes a sequence of integers.

There is one extra improvement made in regards to annotations at the time of writing this book, and that is that starting from Python 3.6, it is possible to annotate variables directly, not just function parameters and return types. This was introduced in PEP-526, and the idea is that you can declare the types of some variables defined without necessarily assigning a value to them, as shown in the following listing:

class Point:
    lat: float
    long: float
 
>>> Point.__annotations__
{'lat': <class 'float'>, 'long': <class 'float'>}

Do annotations replace docstrings?

This is a valid question, since on older versions of Python, long before annotations were introduced, the way of documenting the types of the parameters of functions or attributes was done by putting docstrings on them. There are even some conventions on formats on how to structure docstrings to include the basic information for a function, including types and meaning of each parameter, type, and meaning of the result, and possible exceptions that the function might raise.

Most of this has been addressed already in a more compact way by means of annotations, so one might wonder if it is really worth having docstrings as well. The answer is yes, and this is because they complement each other.

It is true that a part of the information previously contained on the docstring can now be moved to the annotations. But this should only leave more room for a better documentation on the docstring. In particular, for dynamic and nested data types, it is always a good idea to provide examples of the expected data so that we can get a better idea of what we are dealing with.

Consider the following example. Let's say we have a function that expects a dictionary to validate some data:

def data_from_response(response: dict) -> dict:
    if response["status"] != 200:
        raise ValueError
    return {"data": response["payload"]}

Here, we can see a function that takes a dictionary and returns another dictionary. Potentially, it can raise an exception if the value under the key "status" is not the expected one. However, we do not have much more information about it. For example, what does a correct instance of a response object look like? What would an instance of result look like? To answer both of these questions, it would be a good idea to document examples of the data that is expected to be passed in by a parameter and returned by this function.

Let's see if we can explain this better with the help of a docstring:

def data_from_response(response: dict) -> dict:
    """If the response is OK, return its payload.
 
    - response: A dict like::
 
    {
        "status": 200, # <int>
        "timestamp": "....", # ISO format string of the current
        date time
        "payload": { ... } # dict with the returned data
    }
 
    - Returns a dictionary like::
 
    {"data": { .. } }
 
    - Raises:
    - ValueError if the HTTP status is != 200
    """
    if response["status"] != 200:
        raise ValueError
    return {"data": response["payload"]}

Now, we have a better idea of what is expected to be received and returned by this function. The documentation serves as valuable input, not only for understanding and getting an idea of what is being passed around, but also as a valuable source for unit tests. We can derive data like this to use as input, and we know what would be the correct and incorrect values to use on the tests. Actually, the tests also work as actionable documentation for our code, but this will be explained in more detail.

The benefit is that now we know what the possible values of the keys are, as well as their types, and we have a more concrete interpretation of what the data looks like. The cost is that, as we mentioned earlier, it takes up a lot of lines, and it needs to be verbose and detailed to be effective.

Configuring the tools for enforcing basic quality gates

In this section, we will explore how to configure some basic tools and automatically run checks on the code, with the goal of leveraging part of the repetitive verification checks.

This is an important point: remember that code is for us, people, to understand, so only we can determine what is good or bad code. We should invest time in code reviews, thinking about what good code is, and how readable and understandable it is. When looking at the code written by a peer, you should ask such questions:

Is this code easy to understand and follow for a fellow programmer?
Does it speak in terms of the domain of the problem?
Would a new person joining the team be able to understand it and work with it effectively?

As we saw previously, code formatting, consistent layout, and proper indentation are required but not sufficient traits to have in a code base. Moreover, this is something that we, as engineers with a high sense of quality, would take for granted, so we would read and write code far beyond the basic concepts of its layout. Therefore, we are not willing to waste time reviewing these kinds of items, so we can invest our time more effectively by looking at actual patterns in the code in order to understand its true meaning and provide valuable results.

All of these checks should be automated. They should be part of the tests or checklist, and this, in turn, should be part of the continuous integration build. If these checks do not pass, make the build fail. This is the only way to actually ensure the continuity of the structure of the code at all times. It also serves as an objective parameter for the team to have as a reference. Instead of having some engineers or the leader of the team always having to tell the same comments about PEP-8 on code reviews, the build will automatically fail, making it something objective.

Type hinting with Mypy

Mypy (http://mypy-lang.org/) is the main tool for optional static type checking in Python. The idea is that, once you install it, it will analyze all of the files on your project, checking for inconsistencies on the use of the types. This is useful since, most of the time, it will detect actual bugs early, but sometimes it can give false positives.

You can install it with pip, and it is recommended to include it as a dependency for the project on the setup file:

$ pip install mypy

Once it is installed in the virtual environment, you just have to run the preceding command and it will report all of the findings on the type checks. Try to adhere to its report as much as possible, because most of the time, the insights provided by it help to avoid errors that might otherwise slip into production. However, the tool is not perfect, so if you think it is reporting a false positive, you can ignore that line with the following marker as a comment:

type_to_ignore = "something" # type: ignore

Checking the code with Pylint

There are many tools for checking the structure of the code (basically, this is compliance with PEP-8) in Python, such as pycodestyle (formerly known as PEP-8), Flake8, and many more. They all are configurable and are as easy to use as running the command they provide. Among all of them, I have found Pylint to be the most complete (and strict). It is also configurable.

Again, you just have to install it in the virtual environment with pip:

$ pip install pylint

Then, just running the pylint command would be enough to check it in the code.

It is possible to configure Pylint via a configuration file named pylintrc.

In this file, you can decide the rules you would like to enable or disable, and parametrize others (for example, to change the maximum length of the column).

Setup for automatic checks

On Unix development environments, the most common way of working is through makefiles. Makefiles are powerful tools that let us configure commands to be run in the project, mostly for compiling, running, and so on. Besides this, we can use a makefile in the root of our project, with some commands configured to run checks of the formatting and conventions on the code, automatically.

A good approach for this would be to have targets for the tests, and each particular test, and then have another one that will run altogether. For example:

typehint:
mypy src/ tests/
 
test:
pytest tests/
 
lint:
pylint src/ tests/
 
checklist: lint typehint test
 
.PHONY: typehint test lint checklist

Here, the command we should run (both in our development machines and in the continuous integration environment builds) is the following:

make checklist

This will run everything in the following steps:

It will first check the compliance with the coding guideline (PEP-8, for instance)
Then it will check for the use of types on the code
Finally, it will run the tests

If any of these steps fail, consider the entire process a failure.

Besides configuring these checks automatically in the build, it is also a good idea if the team adopts a convention and an automatic approach for structuring the code. Tools such as Black (https://github.com/ambv/black) automatically format the code. There are many tools that will edit the code automatically, but the interesting thing about Black is that it does so in a unique form. It's opinionated and deterministic, so the code will always end up arranged in the same way.

For example, Black strings will always be double-quotes, and the order of the parameters will always follow the same structure. This might sound rigid, but it's the only way to ensure the differences in the code are minimal. If the code always respects the same structure, changes in the code will only show up in pull requests with the actual changes that were made, and no extra cosmetic modifications. It's more restrictive than PEP-8, but it's also convenient because, by formatting the code directly through a tool, we don't have to actually worry about that, and we can focus on the crux of the problem at hand.

At the time of writing this book, the only thing that can be configured is the length of the lines. Everything else is corrected by the criteria of the project.

The following code is PEP-8 correct, but it doesn't follow the conventions of black:

def my_function(name):
 """
 >>> my_function('black')
 'received Black'
 """
 return 'received {0}'.format(name.title())

Now, we can run the following command to format the file:

black -l 79 *.py

Now, we can see what the tool has written:

def my_function(name):
 """
 >>> my_function('black')
 'received Black'
 """
 return "received {0}".format(name.title())

On more complex code, a lot more would have changed (trailing commas, and more), but the idea can be seen clearly. Again, it's opinionated, but it's also a good idea to have a tool that takes care of details for us. It's also something that the Golang community learned a long time ago, to the point that there is a standard tool library, got fmt, that automatically formats the code according to the conventions of the language. It's good that Python has something like this now.

These tools (Black, Pylint, Mypy, and many more) can be integrated with the editor or IDE of your choice to make things even easier. It's a good investment to configure your editor to make these kinds of modifications either when saving the file or through a shortcut.

You're reading from Clean Code in Python Refactor your legacy code base

Table of Contents (12) Chapters