Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Modern Python Cookbook

You're reading from   Modern Python Cookbook 130+ updated recipes for modern Python 3.12 with new techniques and tools

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835466384
Length 818 pages
Edition 3rd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Steven F. Lott Steven F. Lott
Author Profile Icon Steven F. Lott
Steven F. Lott
Arrow right icon
View More author details
Toc

Table of Contents (20) Chapters Close

Preface 1. Chapter 1 Numbers, Strings, and Tuples FREE CHAPTER 2. Chapter 2 Statements and Syntax 3. Chapter 3 Function Definitions 4. Chapter 4 Built-In Data Structures Part 1: Lists and Sets 5. Chapter 5 Built-In Data Structures Part 2: Dictionaries 6. Chapter 6 User Inputs and Outputs 7. Chapter 7 Basics of Classes and Objects 8. Chapter 8 More Advanced Class Design 9. Chapter 9 Functional Programming Features 10. Chapter 10 Working with Type Matching and Annotations 11. Chapter 11 Input/Output, Physical Format, and Logical Layout 12. Chapter 12 Graphics and Visualization with Jupyter Lab 13. Chapter 13 Application Integration: Configuration 14. Chapter 14 Application Integration: Combination 15. Chapter 15 Testing 16. Chapter 16 Dependencies and Virtual Environments 17. Chapter 17 Documentation and Style 18. Other Books You May Enjoy
19. Index

1.3 String parsing with regular expressions

How do we decompose a complex string? What if we have complex, tricky punctuation? Or—worse yet—what if we don’t have punctuation, but have to rely on patterns of digits to locate meaningful information?

1.3.1 Getting ready

The easiest way to decompose a complex string is by generalizing the string into a pattern and then writing a regular expression that describes that pattern.

There are limits to the patterns that regular expressions can describe. When we’re confronted with deeply nested documents in a language like HTML, XML, or JSON, we often run into problems and be prohibited from using regular expressions.

The re module contains all of the various classes and functions we need to create and use regular expressions.

Let’s say that we want to decompose text from a recipe website. Each line looks like this:

>>> ingredient = "Kumquat: 2 cups"

We want to separate the ingredient from the measurements.

1.3.2 How to do it...

To write and use regular expressions, we often do this:

  1. Generalize the example. In our case, we have something that we can generalize as:

    (ingredient words): (amount digits) (unit words)
  2. We’ve replaced literal text with a two-part summary: what it means and how it’s represented. For example, ingredient is represented as words, while amount is represented as digits. Import the re module:

    >>> import re
  3. Rewrite the pattern into regular expression (RE) notation:

    >>> pattern_text = r’([\w\s]+):\s+(\d+)\s+(\w+)’

    We’ve replaced representation hints such as ingredient words, a mixture of letters and spaces, with [\w\s]+. We’ve replaced amount digits with \d+. And we’ve replaced single spaces with \s+ to allow one or more spaces to be used as punctuation. We’ve left the colon in place because, in regular expression notation, a colon matches itself.

    For each of the fields of data, we’ve used () to capture the data matching the pattern. We didn’t capture the colon or the spaces because we don’t need the punctuation characters.

    REs typically use a lot of \ characters. To make this work out nicely in Python, we almost always use raw strings. The r’ tells Python not to look at the \ characters and not to replace them with special characters that aren’t on our keyboards.

  4. Compile the pattern:

    >>> pattern = re.compile(pattern_text)
  5. Match the pattern against the input text. If the input matches the pattern, we’ll get a match object that shows details of the substring that matched:

    >>> match = pattern.match(ingredient) 
     
    >>> match is None 
     
    False 
     
    >>> match.groups() 
     
    (’Kumquat’, ’2’, ’cups’)
  6. Extract the named groups of characters from the match object:

    >>> match.group(1) 
     
    ’Kumquat’ 
     
    >>> match.group(2) 
     
    ’2’ 
     
    >>> match.group(3) 
     
    ’cups’

Each group is identified by the order of the capture () portions of the regular expression. This gives us a tuple of the different fields captured from the string. We’ll return to the use of the tuple data structure in the Using tuples of items recipe. This can be confusing in more complex regular expressions; there is a way to provide a name, instead of the numeric position, to identify a capture group.

1.3.3 How it works...

There are a lot of different kinds of string patterns that we can describe with regular expressions.

We’ve shown a number of character classes:

  • \w matches any alphanumeric character (a to z, A to Z, 0 to 9).

  • \d matches any decimal digit.

  • \s matches any space or tab character.

These classes also have inverses:

  • \W matches any character that’s not a letter or a digit.

  • \D matches any character that’s not a digit.

  • \S matches any character that’s not some kind of space or tab.

Many characters match themselves. Some characters, however, have a special meaning, and we have to use \ to escape from that special meaning:

  • We saw that + as a suffix means to match one or more of the preceding patterns. \d+ matches one or more digits. To match an ordinary +, we need to use \+.

  • We also have * as a suffix, which matches zero or more of the preceding patterns. \w* matches zero or more characters. To match a *, we need to use \*.

  • We have ? as a suffix, which matches zero or one of the preceding expressions. This character is used in other places, and has a different meaning in the other context. We’ll see it used in ?P<name>...)|, where it is inside \verb|)— to define special properties for the grouping.

  • The . character matches any single character. To match a . specifically, we need to use \..

We can create our own unique sets of characters using [] to enclose the elements of the set. We might have something like this:

(?P<name>\w+)\s*[=:]\s*(?P<value>.*)

This has a \w+ to match any number of alphanumeric characters. This will be collected into a group called name. It uses \s* to match an optional sequence of spaces. It matches any character in the set [=:]. Exactly one of the two characters in this set must be present. It uses \s* again to match an optional sequence of spaces. Finally, it uses .* to match everything else in the string. This is collected into a group named value.

We can use this to parse strings, like this:

size = 12 
 
weight: 14

By being flexible with the punctuation, we can make a program easier to use. We’ll tolerate any number of spaces, and either an = or a : as a separator.

1.3.4 There’s more...

A long regular expression can be awkward to read. We have a clever Pythonic trick for presenting an expression in a way that’s much easier to read:

>>> ingredient_pattern = re.compile( 
 
... r’(?P<ingredient>[\w\s]+):\s+’ # name of the ingredient up to the ":" 
 
... r’(?P<amount>\d+)\s+’ # amount, all digits up to a space 
 
... r’(?P<unit>\w+)’ # units, alphanumeric characters 
 
... )

This leverages three syntax rules:

  • A statement isn’t finished until the () characters match.

  • Adjacent string literals are silently concatenated into a single long string.

  • Anything between # and the end of the line is a comment, and is ignored.

We’ve put Python comments after the important clauses in our regular expression. This can help us understand what we did, and perhaps help us diagnose problems later.

We can also use the regular expression’s ”verbose” mode to add gratuitous whitespace and comments inside a regular expression string. To do this, we must use re.X as an option when compiling a regular expression to make whitespace and comments possible. This revised syntax looks like this:

>>> ingredient_pattern_x = re.compile(r’’’ 
 
... (?P<ingredient>[\w\s]+):\s+ # name of the ingredient up to the ":" 
 
... (?P<amount>\d+)\s+ # amount, all digits up to a space 
 
... (?P<unit>\w+) # units, alphanumeric characters 
 
... ’’’, re.X)

We can either break the pattern up into separate string components, or make use of extended syntax to make the regular expression more readable. The benefit of providing names shows up when we use the groupdict() method of the match object to extract parsed values by the name associated with the pattern being captured.

1.3.5 See also

You have been reading a chapter from
Modern Python Cookbook - Third Edition
Published in: Jul 2024
Publisher: Packt
ISBN-13: 9781835466384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image