How To Write More Elegant Regex Patterns In Python? | Regex Patterns In Python | 1
|

How To Write More Elegant Regex Patterns In Python?

Regex is, without a doubt, the most helpful text-processing tool ever invented. It helps us find patterns rather than exact words or phrases in a text. And regex engines are noticeably faster too.

Yet, the difficult part is to define a pattern. Experienced programmers may define it on the go. But most developers will have to spend time googling and reading through documentation.

Regardless of experience, everyone finds reading a pattern others defined difficult.

This is the problem PRegEx solves.

PRegEx is a Python library that makes regex patterns more elegant and readable. It’s now one of my favorite libraries for cleaner Python code. 

You can install it from the PyPI repository.

pip install pregex
# Poetry users can install
# poetry add regex

Start writing more readable Regex.

Here’s an example of grasping how cool PRegEx is.

It’s very common to need to extract (US) zip codes from addresses. It’s not difficult if the addresses are standardized. Otherwise, we need to use some clever techniques to extract them. 

United States zip codes are usually five-digit numbers. Also, some zipcodes may have an extension of four digits separated by a hyphen. 

For instance, 88310 is a postal code in New Mexico. Some prefer to use also the geographic segment with an extension like 88310–7241.

Here’s the common approach (using the re module) to find patterns of this kind. 

import re

pattern = r"\d{5}(-\d{4})?"

address = "730 S White Sands Blvd, Alamogordo, NM 88310, United States"

zip_code = re.search(pattern, address).group()

print(zip_code)

# 88310

The steps may seem straightforward. However, if you’re to explain how you defined the pattern to a novice programmer, you’ll have to do an hour-long lecture. 

I’m not going to explain it, either. Because we have PRegEx. Here’s the PRegEx version of it. 

from pregex.classes import AnyDigit
from pregex.quantifiers import Exactly, Optional

pattern = Exactly(AnyDigit(), 5) + Optional("-" + Exactly(AnyDigit(), 4))

address1 = "730 S White Sands Blvd, Alamogordo, NM 88310, United States"
address2 = "730 S White Sands Blvd, Alamogordo, NM 88310-7421, United States"

pattern.get_matches(address1)
# ['88310']

pattern.get_matches(address2)
# ['88310-7421']

As you can see, this code is both simple to define and understand. 

The pattern has two segments. The second one is an optional one. The first segment should have exactly 5 digits. The second segment, if available, should have a hyphen and four numbers. 

Understand the submodules to create more interesting regex patterns.

Here we used a couple of submodules of the PRegEx library — classes and quantifiers. The class submodule determines what to match, and the quantifier submodule help specify how many repetitions to perform. 

You could use other classes such as AnyButDigit to match non-numeric values or AnyLowercaseLetter with lowercase strings. You could also use different types of quantifiers such as OneOrMore, AtLeast, AtMost, or Indefinite to create more complex regex patterns. 

Here’s another example with more interesting matches. We need to find out email addresses in a text. That’s simple. But we’re also interested in capturing the domains of email addresses in addition to matching the pattern. 

from pregex.classes import AnyButWhitespace
from pregex.groups import Capture
from pregex.quantifiers import OneOrMore, AtLeastAtMost


pattern = (
    OneOrMore(AnyButWhitespace())
    + "@"
    + Capture(
        OneOrMore(AnyButWhitespace()) + "." + AtLeastAtMost(AnyButWhitespace(), 2, 3)
    )
)

text = """My names is Alice. I live in Wonderland. You can mail me: [email protected]. 
In case if I couldn't reply, please main my friend the White Rabbit: [email protected].
But for more serious issues, you should main Tony Stark at [email protected].
"""

# Get everything you captured. 
pattern.get_captures(text)
# [('wonderland.com',), ('wonderland.com',), ('stark.org',)]

# Get all your matches. 
pattern.get_matches(text)
# ['[email protected]', '[email protected]', '[email protected]']

In the above example, we’ve used the Capture class from the group’s submodule. It allows us to collect segments within a match so that you don’t have to do any post-processing to extract them. 

Another submodule you’d often need is the operator module. It helps you concatenate patterns together or select either of a set of options. 

Here’s a slightly modified version of the same example above. 

from pregex.classes import AnyButWhitespace
from pregex.groups import Capture
from pregex.operators import Either
from pregex.quantifiers import OneOrMore


pattern = (
    OneOrMore(AnyButWhitespace())
    + "@"
    + Capture(OneOrMore(AnyButWhitespace()) + Either(".com", ".org"))
)

text = """My names is Alice. I live in Wonderland. You can mail me: [email protected]. 
In case if I couldn't reply, please main my friend the White Rabbit: [email protected].
But for more serious issues, you should main Tony Stark at [email protected].
Please don't message [email protected]
"""

pattern.get_captures(text)
# [('wonderland.com',), ('wonderland.com',), ('stark.org',)]

In the above example, we’ve restricted the top-level domain to either ‘.com’ or ‘.org. We’ve used the ‘Either’ class from the operator submodule to build this pattern. As you can see, it didn’t match with [email protected] as its top-level domain is ‘.err,’ not ‘.com’ or ‘.org.’

Final thoughts

Defining regex may not be a huge task for experienced developers. But even for them, reading and understanding a pattern created by someone else is difficult. For beginners, both can be daunting. 

Besides, regex is an excellent tool for text mining. Any developer or data scientist will almost certainly come across regex usage. 

If you’re a Python programmer, PRegEx has the difficult parts covered. 

Thanks for reading, friend! Did you like what you read? Consider subscribing to my email newsletter because I frequently post more like this. Say Hi to me on LinkedIn, Twitter, and Medium.

Not a Medium member yet? Please use this link to become a member because, at no extra cost for you, I earn a small commission for referring you.

Similar Posts