Generating more readable code
In this section, you will learn how to generate code that others can easily read and get moving without them having to ponder its inner workings for hours.
If we are writing code, we want it to be used and that means it should be understood by other people, especially if we want someone to expand upon it. They’re much more likely to bother if we make our code easy to read.
In the subsection Code to compress data, written in Python 3.10, we’ll ask an LLM for some code to compress data, but first, here’s an intro to what data compression is, so we know a bit more about what we’re doing. Data compression is important for using devices and the internet.
Introduction to data compression methods
Here’s some background on what data compression is so that the following section is easier to understand.
Lossless data compression
Lossless data compression means the signal and the noise are compressed, nothing is lost, and everything can be re-created [Wiki_LZMA2].
An example of lossless compression would be a dataset like this: [1,1,1,1,1,1,1….1,1,1,1,1,1,0,0,0,0,0,0…].
This dataset could have any number of ones and any number of zeros, but a value can be represented by just the count of each and the name of each. If there were 213 ones and 789 zeros, the lossless compression method Run-Length Encoding (RLE) would call it [(213, 1), (789,0)]. The data is now just 4 numbers, not 213+789 = 1,002 numbers, which is a compression ratio of 1002/4 = 250.5 [WikiDataComp]. This has managed a good compression ratio without losing anything. This could be real data, for example, the answer to the question, “Is the windscreen wiper on?”.
Use lossless compression when you have a signal that tends to be constant for a long while. If you use it for very precise or always varying data, it’ll, instead, take up more space on your drive!
LZMA is another type of lossless compression method, which we shall use in a moment.
Lossy compression
A second type of compression is called lossy compression, where the algorithm or method tries to keep the signal and eliminate some level of noise. For lossy compression, there can be different levels of compression, offering a trade-off of fidelity and file size: more compression means less fidelity – more of the signal and noise are thrown out or zeroed.
Examples of lossy compression include Discrete Fourier Transforms (DFTs), Discrete Wavelet Transforms (DWTs), or, best yet, Discrete Cosine Transforms (DCTs) [Wiki_Wavelets].
DCTs are used in a lot of standard compression methods for images (JPEG), video (MPEG), audio (Dolby Digital, MP3, AAC), digital television (HDTV, SDTV), and digital radio (AAC+, DAB) [Wiki_Cosine].
DCTs transform data from changing over time to being made up of lots of cosine waves of differing frequencies. It sounds ridiculously simple, but you can represent most things this way. I did simplify it. You generally keep the low-frequency waves and zero out the high-frequency waves, which are assumed to be noise. Then the signal is changed back to the original time-based form: varying over time.
If you want more fidelity – a better signal – you’ll zero out fewer of the high-frequency waves but will have to suffer a file size that’s not reduced much. If you really need a small file size, you’ll zero out more waves but that’ll leave you with a worse signal, image, or sound. Another way to look at the compression ratio is the bit rate, measured in bits/second. A higher bit rate can give a higher quality stream of data (e.g., video or sound) and, if it’s stored in a file, this means larger file sizes.
Alternatively, instead of zeroing the values, you can quantize them. Here, rather than having highly precise values, you could turn them into less precise values – for example, 3.48724 could fit into the [3.4 to 3.5] bin or, even less precise, the [3 to 4] bin. The values become a histogram – discrete and not so continuous. This saves space.
To get the ideal compressed data, there needs to be a balance between good compression and a good quality of signal, image, or video.
Learned compression or neural compression
A third type of compression is learned compression, where an ML method tries to learn what the signal is and keep that, at the expense of the noise.
The first two kinds of compression, lossless and lossy, are rule based, but in learned compression, neural networks are used to learn the patterns in the data and extract specific statistical properties. These learned methods are more adaptive [Gemini]. When I compressed tables or files of data with lossless and lossy methods, I had to try different methods and select the best method for each type of data [Taylor_2022]. This is an excellent method but is less adaptive than learned compression.
Learned compression models are more adaptable and the model can continuously learn, thus improving with new data [Gemini].
A method of learned compression is the autoencoder (AE). It is a palindromic or mirror-shaped neural network that takes in data, compresses it in a tiny space – a latent space, then reproduces the input data. The output layer looks like and is the same size as the input layer. These are the biggest layers, with layers reducing in size toward the middle, similar to two triangles, or perhaps a bowtie. This is useful because the latent space is a really compressed version of the data – there’s no space for excess! The first side or triangle is the encoder and the second side is the decoder. The encoder first compresses the data, then the decoder reconstructs the input data.
In learned compression, the balance between good compression and good quality is the loss function [Ioijilai].
Advantages of learned compression
- Adaptability: Unlike more traditional compression methods, learned compression models can adapt to complex and diverse data types, such as scientific data, because of relationships between variables.
- Efficiency: Learned compression models can sometimes compress better than older methods:
- Learned models can learn the statistical properties of language and underlying structures in audio data, so can compress data more efficiently.
- Convolutional Neural Networks (CNNs) are especially adept at detecting patterns and redundancies in images and videos, so can compress more effectively than JPEG and MPEG. The effect is more pronounced in higher-resolution data.
- Overall, the higher possible complexity of ML models can enable the learning of more intricate structures, leading to better compression.
- Generating data: Decoders can create more data examples like the input data.
Disadvantages of learned compression
- Complexity: It can be more effective, but it is also more difficult to understand.
- Computational costs: These ML models can be very expensive to train and run. This is a bigger disadvantage for applications that require real-time processing, such as autonomous vehicles. You might well have to deploy a simplified and optimized trained model to the edge or use specially designed hardware.
- Lossy: Learned compression can be lossy. This means the trade-off between quality and compression needs to be struck.
My background includes creating data compression software. Working with others at the University of Warwick, I created data compression software methodologies for Jaguar Land Rover and Onsemi, an American semiconductor manufacturer. If you want to know more, I’ve got a paper or two on compression, as well as other ML research here: https://scholar.google.com/citations?user=9lCeU94AAAAJ&hl.
After compression, you can store or transmit data. But to get the data back, you’ll need to decompress it, usually.
Now, let’s get some code to do compression ourselves.
Code to compress data, written in Python 3.10
There are at least three types of data compression: lossless compression, lossy compression, and learned compression.
Let’s try the following code (see “Claude 3 provides the following output”); this is to compress data and make it smaller on file with a commonly used compression algorithm called the Lempel-Ziv-Markov chain algorithm or LZMA2. For this code, we will use lossless data compression, courtesy of LZMA.
LZMA2 is more powerful than LZMA; it’s a container format for uncompressed and LZMA data, as well as the compression of data that’s partially incompressible [Wiki_LZMA2, Taylor_2022].
Anyway, the code here should compress data with LZMA2.
The code is courtesy of the Claude 3 LLM.
Asking Claude 3 to write some bad code
Let’s start with how to get code that works but is hard to read. This is to compare it with code that you might have already but could be more explainable. LLMs will hopefully never give you this kind of code unless you explicitly ask for poorly written code.
We will use the following prompt:
Please give me some Python 3.10 code that compresses tabular data with LZMA2 that is very hard to understand because it's badly written.
Claude 3 provides the following output:
Here's an intentionally obfuscated and poorly written Python 3.10 code that compresses tabular data using the LZMA2 algorithm. This code is designed to be hard to understand and maintain: python import lzma from typing import Any def j(b): c=[] d=lzma.LZMACompressor(lzma.FORMAT_ALONE) for a in b:e=''.join([str(f)+'\t'for f in a]);e=e[:-1].encode();c.extend(list(d.compress(e))) c.extend(list(d.flush())) return bytes(c) def l(m): o=[];p=lzma.LZMADecompressor();q=b'' for s in [m]: # Change: Iterate over a list containing the compressed data try: t=p.decompress(s) q+=t except lzma.LZMAError: u=p.flush() o.append(q.decode().split('\t')) q=u if q: o.append(q.decode().split('\t')) return o def k(v): w=[];x=[] for y in v:x.extend([z.split('\t')for z in y]) for a in zip(*x):w.append(list(a)) return w if __name__=='__main__': A=k([['Name','Age','City'],['Alice','25','New York'],['Bob','30','Los Angeles'],['Charlie','35','Chicago']]) B=j(A) with open('compressed.lzma','wb')as C:C.write(B) with open('compressed.lzma','rb')as D:E=D.read() F=l(E) G=k(F) print(G)
You can find this code in the LZMAcompress_bad.py
file in our GitHub repo found here:
This code took a few rounds of debugging, but what you have here works for the little toy dataset given. Claude 3 was very helpful. It made one mistake not correcting a bug, but the whole process was very easy. You just need to mention the code again if the chatbot has already seen it or paste it into the chat field if it’s not been seen, then state the problem, giving the line the error message says the error is about. Then, the chatbot should give a better version of the code. Do this until the code works nicely for you, then test it.
There’s not much commenting. The variable and function names are really unclear. It sometimes does a great deal in one line, and in a for
statement line at that! Conditional statements shouldn’t have calculations or string operations in them! It prints G
at the end, whatever that is. Variables are sometimes single lowercase letters and sometimes single uppercase letters – no convention.
I don’t really like that the ages are strings – couldn’t they be ints
or at least floats
?
It’s using try/except
, which is good, and it’s using different functions, which are not large, so the code is not necessarily poor from an operation point of view.
I’m not sure if l()
will always return something.
This code is really unclear. I would have to look at it for a while to figure out what’s going on.
It uses for
statements, which are unpopular because they’re slow in languages like Python compared with vectorized ways.
Well, that’s what I caught from a quick look.
Here is what Claude 3 says about its own example of bad code
The intentionally obfuscated and poorly written code that compresses tabular data using the LZMA2 algorithm in Python 3.10 exhibits several characteristics that make it hard to read and understand. Here are some of the main issues with the code:
- Poor variable naming: The code uses single-letter variable names such as
a, b, c, d, e, f, m, n, o, p, q, r, s, t, u, v, w, x, y,
andz
, which provide no meaningful information about the purpose or content of the variables. This makes it extremely difficult to understand what each variable represents or what it’s being used for. - Lack of comments: The code is completely devoid of comments or docstrings, which would normally provide explanations of the code’s purpose, functionality, and logic. Without any comments, it becomes a guessing game to understand what the code is trying to accomplish.
- Convoluted logic: The logic for compressing and decompressing the tabular data is spread across multiple functions (
j, l,
andk
) and is unnecessarily complex. The code uses nested list comprehensions, unnecessary type conversions, and obscure operations that make it challenging to follow the flow of execution. - Violation of coding conventions: The code violates several Python coding conventions and best practices, such as using meaningful variable names, proper indentation and formatting, and separation of concerns. This makes it harder for developers familiar with Python conventions to understand the code.
- Lack of error handling: The code does not include any error handling mechanisms or exception handling, which can lead to unexpected behavior and crashes in the case of errors or edge cases.
- Mixing of functionality: The code mixes functionality related to compression, decompression, and data manipulation, making it harder to follow and maintain. Ideally, these concerns should be separated into distinct functions or modules.
- Obscure type annotations: While the code uses type annotations, the annotations themselves are obscure and unhelpful due to the poor variable naming conventions – for example, the type annotation
Any
is used for all variables, which defeats the purpose of type annotations. - Unnecessary complexity: The code introduces unnecessary complexity by using obscure techniques and terse expressions, sacrificing readability for brevity. This makes it harder for developers to understand and maintain the code, especially if they are unfamiliar with such techniques.
- Lack of modularity: The code is not designed in a modular way, making it difficult to integrate or reuse in larger applications or libraries.
- Limited extensibility: Due to the obfuscated nature of the code and the lack of proper documentation or comments, it becomes challenging to extend or modify the code without risking unintended consequences or introducing new bugs.
Overall, the code exhibits numerous examples of poor coding practices and intentional obfuscation, making it extremely difficult to read, understand, and maintain. While this exercise is valuable for understanding the challenges of working with poorly written code, it serves as a reminder of the importance of writing clear, readable, and well-documented code in real-world software development scenarios.
That was the end of the Claude 3 output.
Next, we must get some code that both works well and is easy to read, following good practices. Again, I went with the best coder LLM at the time, Claude 3.
Let’s look at some well-written code
We will now ask Claude 3 for some good code like the preceding: data compression with Python:
Please give me some Python 3.10 code that compresses tabular data with LZMA2 that is easy to understand because it's well-written.
Claude 3 gives us the following output:
Here's a well-written Python 3.10 code that compresses tabular data using the LZMA2 algorithm. This code follows best practices and is designed to be easy to read and understand: python import lzma from typing import List def compress_tabular_data(tabular_data: List[List[str]]) -> bytes: """ Compresses tabular data using the LZMA2 algorithm. Args: tabular_data: A list of lists representing tabular data, where each inner list is a row. Returns: A bytes object containing the compressed tabular data. """ compressor = lzma.LZMACompressor(lzma.FORMAT_ALONE) compressed_data = [] for row in tabular_data: row_bytes = '\t'.join(row).encode() + b'\n' compressed_data.append(compressor.compress(row_bytes)) compressed_data.append(compressor.flush()) return b''.join(compressed_data) def decompress_tabular_data(compressed_data: bytes) -> List[List[str]]: """ Decompresses tabular data compressed using the LZMA2 algorithm. Args: compressed_data: A bytes object containing the compressed tabular data. Returns: A list of lists representing the decompressed tabular data, where each inner list is a row. """ decompressor = lzma.LZMADecompressor() decompressed_data = decompressor.decompress(compressed_data) rows = [row.split('\t') for row in decompressed_data.decode().split('\n') if row] return rows if __name__ == '__main__': tabular_data = [ ['Name', 'Age', 'City'], ['Alice', '25', 'New York'], ['Bob', '30', 'Los Angeles'], ['Charlie', '35', 'Chicago'] ] # Compress tabular data compressed_data = compress_tabular_data(tabular_data) # Save compressed data to a file with open('compressed_data.lzma', 'wb') as file: file.write(compressed_data) # Load compressed data from the file with open('compressed_data.lzma', 'rb') as file: compressed_data = file.read() # Decompress tabular data decompressed_data = decompress_tabular_data(compressed_data) print(decompressed_data)
This code executes successfully. You can find this code in the LZMAcompress_good.py
file in the GitHub repo here: https://github.com/PacktPublishing/Coding-with-ChatGPT-and-other-LLMs/tree/main/Chapter4//LZMAcompress_good.py.
What Claude 3 says about its own good code
This code is easy to read and understand for the following reasons:
- Clear function names: The function names
compress_tabular_data
anddecompress_tabular_data
clearly describe the purpose of each function - Type hints: The code uses type hints to specify the expected types of function arguments and return values, making it easier to understand the input and output formats
- Docstrings: Each function has a docstring that provides a brief description of the function’s purpose, its arguments, and its return value
- Meaningful variable names: The variable names used in the code are descriptive and self-explanatory, making it easier to understand what each variable represents
- Separation of concerns: The code separates the compression and decompression logic into two distinct functions, making it easier to maintain and extend
- Proper indentation and formatting: The code is properly indented and formatted, making it readable and visually appealing
- Error handling: While not explicitly shown in this example, the code can be extended to include error-handling mechanisms to handle potential exceptions or errors during compression or decompression.
- Modular design: The code is designed in a modular way, allowing for easy integration into larger applications or libraries.
- Use of standard library: The code uses the standard
lzma
module from the Python standard library, which is well-documented and widely used - Example usage: The code includes an example usage in the
if __name__ == '__main__'
block, making it easy to understand how to use the functions
Overall, this code follows best practices for writing clean, readable, and maintainable Python code. It should be easy to understand for developers familiar with Python and the LZMA2 compression algorithm.
That was the end of the Claude 3 output and our discussion of what is good about its more readable version of the LZMA compression code in Python.
Now, let’s get to specific points that make code easy or hard to read and understand.