Python Regular Expressions Guide

Regular expressions (regex) are a powerful tool for matching patterns in text. Python's re module provides support for working with regular expressions, allowing you to search, match, and manipulate strings with ease. In this blog post, we will explore various regex patterns and how to use them in Python, using the notes provided.

Basic Patterns

.: Matches any single character except newline \n

 import re
 text = "hello world"
 match = re.findall(".", text)
 print(match)  # ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

^: Matches the start of a string

 text = "hello world"
 match = re.match("^hello", text)
 print(match)  # <re.Match object; span=(0, 5), match='hello'>

$: Matches the end of a string

 match = re.search("world$", text)
 print(match)  # <re.Match object; span=(6, 11), match='world'>

\b: Matches a word boundary

 match = re.search(r"\bworld\b", text)
 print(match)  # <re.Match object; span=(6, 11), match='world'>

\B: Matches a non-word boundary

 match = re.search(r"\Bworld", "hello world!")
 print(match)  # None

Character Classes

\d: Matches any digit

 match = re.findall(r"\d", "123abc456")
 print(match)  # ['1', '2', '3', '4', '5', '6']

\D: Matches any non-digit character

 match = re.findall(r"\D", "123abc456")
 print(match)  # ['a', 'b', 'c']

\w: Matches any alphanumeric character

 match = re.findall(r"\w", "hello_world123")
 print(match)  # ['h', 'e', 'l', 'l', 'o', '_', 'w', 'o', 'r', 'l', 'd', '1', '2', '3']

\W: Matches any non-alphanumeric character

 match = re.findall(r"\W", "hello world!")
 print(match)  # [' ', '!']

\s: Matches any whitespace character

match = re.findall(r"\s", "hello world")
print(match)  # [' ']

\S: Matches any non-whitespace character

match = re.findall(r"\S", "hello world")
print(match)  # ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

Brackets and Groups

[]: Matches any single character within the brackets

match = re.findall(r"[aeiou]", "hello world")
print(match)  # ['e', 'o', 'o']

[^]: Matches any single character not within the brackets

match = re.findall(r"[^aeiou]", "hello world")
print(match)  # ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd']

(): Groups multiple tokens together and captures the matched text

match = re.search(r"(hello) (world)", "hello world")
print(match.groups())  # ('hello', 'world')

|: Alternation; matches either the pattern before or after the pipe symbol

match = re.search(r"hello|world", "hello world")
print(match.group())  # 'hello'

Quantifiers

Quantifiers in regular expressions define how many times the preceding element (character or group) must occur for a match to be found. They provide flexibility and power in pattern matching. Let's delve deeper into the most common quantifiers with examples.

`*`: Matches Zero or More Occurrences

The * quantifier matches zero or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, including none.

Example:

import re

pattern = r"he*"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['heee', 'he', 'h']

Explanation:

heee: Matches because h is followed by three es.
he: Matches because h is followed by one e.
h: Matches because h is followed by zero es.

`+`: Matches One or More Occurrences

The + quantifier matches one or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, but at least one must be present.

Example:

pattern = r"he+"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['heee', 'he']

Explanation:

heee: Matches because h is followed by three es.
he: Matches because h is followed by one e.

`?`: Matches Zero or One Occurrence

The ? quantifier matches zero or one occurrence of the preceding character or group. This means it will match at most one occurrence.

Example:

pattern = r"he?"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['he', 'he', 'h']

Explanation:

he: Matches because h is followed by one e.
he: Matches because h is followed by one e.
h: Matches because h is followed by zero es.

`{}`: Exact Number or Range of Occurrences

The {} quantifier specifies the exact number or range of occurrences of the preceding character or group.

Example:

pattern = r"he{2}"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['hee']

Explanation:

hee: Matches because h is followed by exactly two es.

`.`, `.?`: Greedy and Non-Greedy Quantifiers

.*: Greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as much text as possible.
.*?: Non-greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as little text as possible.

Greedy Example:

pattern = r"<.*>"
text = "<tag>content</tag>"

match = re.search(pattern, text)
print(match.group())  # Output: '<tag>content</tag>'

Explanation:

.*: Matches everything from the first < to the last >, resulting in the entire string <tag>content</tag> being matched.

Non-Greedy Example:

pattern = r"<.*?>"
text = "<tag>content</tag>"

match = re.search(pattern, text)
print(match.group())  # Output: '<tag>'

Explanation:

.*?: Matches as little as possible, stopping at the first >, resulting in <tag> being matched.

Lookahead and Lookbehind Assertions

Lookahead and lookbehind assertions are advanced regular expression techniques that allow you to match a pattern only if it is followed or preceded by another pattern, without including those patterns in the match. They are essential tools for creating complex and precise regex patterns. Let's explore these assertions in detail with examples.

Positive Lookahead Assertion (?=...)

A positive lookahead assertion (?=...) ensures that the specified pattern exists after the current position without including it in the match.

Example:
```
pythonCopy codeimport re

pattern = r"\w+(?=\s)"
text = "hello world"

match = re.search(pattern, text)
print(match.group())  # Output: 'hello'
```
Explanation:
- \w+: Matches one or more alphanumeric characters.
- (?=\s): Ensures that the matched characters are followed by a whitespace character.
- The pattern matches hello because hello is followed by a space.

Negative Lookahead Assertion `(?!...)`

A negative lookahead assertion (?!...) ensures that the specified pattern does not exist after the current position.

Example:

    pythonCopy codepattern = r"\w+(?!\s)"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'world'

Explanation:

\w+: Matches one or more alphanumeric characters.
(?!\s): Ensures that the matched characters are not followed by a whitespace character.
The pattern matches world because world is not followed by a space.

Positive Lookbehind Assertion `(?<=...)`

A positive lookbehind assertion (?<=...) ensures that the specified pattern exists before the current position without including it in the match.

Example:

    pythonCopy codepattern = r"(?<=\s)\w+"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'world'

Explanation:

(?<=\s): Ensures that the matched characters are preceded by a whitespace character.
\w+: Matches one or more alphanumeric characters.
The pattern matches world because world is preceded by a space.

Negative Lookbehind Assertion `(?<!...)`

A negative lookbehind assertion (?<!...) ensures that the specified pattern does not exist before the current position.

Example:

    pythonCopy codepattern = r"(?<!\s)\w+"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'hello'

Explanation:

(?<!\s): Ensures that the matched characters are not preceded by a whitespace character.
\w+: Matches one or more alphanumeric characters.
The pattern matches hello because hello is not preceded by a space.

Conclusion

Regular expressions are a versatile tool for text processing, enabling you to match, search, and manipulate strings efficiently. By mastering these patterns and understanding how to use them in Python, you can perform complex text operations with ease. Whether you're parsing logs, validating input, or extracting data, regex provides the flexibility and power needed for a wide range of tasks.

Understanding Regular Expressions in Python

Table of contents

Basic Patterns

Character Classes

Brackets and Groups

Quantifiers

`*`: Matches Zero or More Occurrences

`+`: Matches One or More Occurrences

`?`: Matches Zero or One Occurrence

`{}`: Exact Number or Range of Occurrences

`.`, `.?`: Greedy and Non-Greedy Quantifiers

Lookahead and Lookbehind Assertions

Positive Lookahead Assertion `(?=...)`

Negative Lookahead Assertion `(?!...)`

Positive Lookbehind Assertion `(?<=...)`

Negative Lookbehind Assertion `(?<!...)`

Conclusion

Subscribe to my newsletter

Emeron Marcelle

Emeron Marcelle

Understanding Regular Expressions in Python

Table of contents

Basic Patterns

Character Classes

Brackets and Groups

Quantifiers

*: Matches Zero or More Occurrences

+: Matches One or More Occurrences

?: Matches Zero or One Occurrence

{}: Exact Number or Range of Occurrences

.*, .*?: Greedy and Non-Greedy Quantifiers

Lookahead and Lookbehind Assertions

Positive Lookahead Assertion (?=...)

Negative Lookahead Assertion (?!...)

Positive Lookbehind Assertion (?<=...)

Negative Lookbehind Assertion (?<!...)

Conclusion

Subscribe to my newsletter

Emeron Marcelle

Emeron Marcelle

`*`: Matches Zero or More Occurrences

`+`: Matches One or More Occurrences

`?`: Matches Zero or One Occurrence

`{}`: Exact Number or Range of Occurrences

`.`, `.?`: Greedy and Non-Greedy Quantifiers

Positive Lookahead Assertion `(?=...)`

Negative Lookahead Assertion `(?!...)`

Positive Lookbehind Assertion `(?<=...)`

Negative Lookbehind Assertion `(?<!...)`