Understanding Regular Expressions in Python
Regular expressions (regex) are a powerful tool for matching patterns in text. Python's re
module provides support for working with regular expressions, allowing you to search, match, and manipulate strings with ease. In this blog post, we will explore various regex patterns and how to use them in Python, using the notes provided.
Basic Patterns
.
: Matches any single character except newline\n
import re text = "hello world" match = re.findall(".", text) print(match) # ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
^
: Matches the start of a stringtext = "hello world" match = re.match("^hello", text) print(match) # <re.Match object; span=(0, 5), match='hello'>
$
: Matches the end of a stringmatch = re.search("world$", text) print(match) # <re.Match object; span=(6, 11), match='world'>
\b
: Matches a word boundarymatch = re.search(r"\bworld\b", text) print(match) # <re.Match object; span=(6, 11), match='world'>
\B
: Matches a non-word boundarymatch = re.search(r"\Bworld", "hello world!") print(match) # None
Character Classes
\d
: Matches any digitmatch = re.findall(r"\d", "123abc456") print(match) # ['1', '2', '3', '4', '5', '6']
\D
: Matches any non-digit charactermatch = re.findall(r"\D", "123abc456") print(match) # ['a', 'b', 'c']
\w
: Matches any alphanumeric charactermatch = re.findall(r"\w", "hello_world123") print(match) # ['h', 'e', 'l', 'l', 'o', '_', 'w', 'o', 'r', 'l', 'd', '1', '2', '3']
\W
: Matches any non-alphanumeric charactermatch = re.findall(r"\W", "hello world!") print(match) # [' ', '!']
\s
: Matches any whitespace charactermatch = re.findall(r"\s", "hello world") print(match) # [' ']
\S
: Matches any non-whitespace charactermatch = re.findall(r"\S", "hello world") print(match) # ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']
Brackets and Groups
[]
: Matches any single character within the bracketsmatch = re.findall(r"[aeiou]", "hello world") print(match) # ['e', 'o', 'o']
[^]
: Matches any single character not within the bracketsmatch = re.findall(r"[^aeiou]", "hello world") print(match) # ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd']
()
: Groups multiple tokens together and captures the matched textmatch = re.search(r"(hello) (world)", "hello world") print(match.groups()) # ('hello', 'world')
|
: Alternation; matches either the pattern before or after the pipe symbolmatch = re.search(r"hello|world", "hello world") print(match.group()) # 'hello'
Quantifiers
Quantifiers in regular expressions define how many times the preceding element (character or group) must occur for a match to be found. They provide flexibility and power in pattern matching. Let's delve deeper into the most common quantifiers with examples.
*
: Matches Zero or More Occurrences
The *
quantifier matches zero or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, including none.
Example:
import re
pattern = r"he*"
text = "heeello helo"
match = re.findall(pattern, text)
print(match) # Output: ['heee', 'he', 'h']
Explanation:
heee
: Matches becauseh
is followed by threee
s.he
: Matches becauseh
is followed by onee
.h
: Matches becauseh
is followed by zeroe
s.
+
: Matches One or More Occurrences
The +
quantifier matches one or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, but at least one must be present.
Example:
pattern = r"he+"
text = "heeello helo"
match = re.findall(pattern, text)
print(match) # Output: ['heee', 'he']
Explanation:
heee
: Matches becauseh
is followed by threee
s.he
: Matches becauseh
is followed by onee
.
?
: Matches Zero or One Occurrence
The ?
quantifier matches zero or one occurrence of the preceding character or group. This means it will match at most one occurrence.
Example:
pattern = r"he?"
text = "heeello helo"
match = re.findall(pattern, text)
print(match) # Output: ['he', 'he', 'h']
Explanation:
he
: Matches becauseh
is followed by onee
.he
: Matches becauseh
is followed by onee
.h
: Matches becauseh
is followed by zeroe
s.
{}
: Exact Number or Range of Occurrences
The {}
quantifier specifies the exact number or range of occurrences of the preceding character or group.
Example:
pattern = r"he{2}"
text = "heeello helo"
match = re.findall(pattern, text)
print(match) # Output: ['hee']
Explanation:
hee
: Matches becauseh
is followed by exactly twoe
s.
.*
, .*?
: Greedy and Non-Greedy Quantifiers
.*
: Greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as much text as possible..*?
: Non-greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as little text as possible.
Greedy Example:
pattern = r"<.*>"
text = "<tag>content</tag>"
match = re.search(pattern, text)
print(match.group()) # Output: '<tag>content</tag>'
Explanation:
.*
: Matches everything from the first<
to the last>
, resulting in the entire string<tag>content</tag>
being matched.
Non-Greedy Example:
pattern = r"<.*?>"
text = "<tag>content</tag>"
match = re.search(pattern, text)
print(match.group()) # Output: '<tag>'
Explanation:
.*?
: Matches as little as possible, stopping at the first>
, resulting in<tag>
being matched.
Lookahead and Lookbehind Assertions
Lookahead and lookbehind assertions are advanced regular expression techniques that allow you to match a pattern only if it is followed or preceded by another pattern, without including those patterns in the match. They are essential tools for creating complex and precise regex patterns. Let's explore these assertions in detail with examples.
Positive Lookahead Assertion
(?=...)
A positive lookahead assertion
(?=...)
ensures that the specified pattern exists after the current position without including it in the match.Example:
pythonCopy codeimport re pattern = r"\w+(?=\s)" text = "hello world" match = re.search(pattern, text) print(match.group()) # Output: 'hello'
Explanation:
\w+
: Matches one or more alphanumeric characters.(?=\s)
: Ensures that the matched characters are followed by a whitespace character.The pattern matches
hello
becausehello
is followed by a space.
Negative Lookahead Assertion (?!...)
A negative lookahead assertion (?!...)
ensures that the specified pattern does not exist after the current position.
Example:
pythonCopy codepattern = r"\w+(?!\s)"
text = "hello world"
match = re.search(pattern, text)
print(match.group()) # Output: 'world'
Explanation:
\w+
: Matches one or more alphanumeric characters.(?!\s)
: Ensures that the matched characters are not followed by a whitespace character.The pattern matches
world
becauseworld
is not followed by a space.
Positive Lookbehind Assertion (?<=...)
A positive lookbehind assertion (?<=...)
ensures that the specified pattern exists before the current position without including it in the match.
Example:
pythonCopy codepattern = r"(?<=\s)\w+"
text = "hello world"
match = re.search(pattern, text)
print(match.group()) # Output: 'world'
Explanation:
(?<=\s)
: Ensures that the matched characters are preceded by a whitespace character.\w+
: Matches one or more alphanumeric characters.The pattern matches
world
becauseworld
is preceded by a space.
Negative Lookbehind Assertion (?<!...)
A negative lookbehind assertion (?<!...)
ensures that the specified pattern does not exist before the current position.
Example:
pythonCopy codepattern = r"(?<!\s)\w+"
text = "hello world"
match = re.search(pattern, text)
print(match.group()) # Output: 'hello'
Explanation:
(?<!\s)
: Ensures that the matched characters are not preceded by a whitespace character.\w+
: Matches one or more alphanumeric characters.The pattern matches
hello
becausehello
is not preceded by a space.
Conclusion
Regular expressions are a versatile tool for text processing, enabling you to match, search, and manipulate strings efficiently. By mastering these patterns and understanding how to use them in Python, you can perform complex text operations with ease. Whether you're parsing logs, validating input, or extracting data, regex provides the flexibility and power needed for a wide range of tasks.
Subscribe to my newsletter
Read articles from Emeron Marcelle directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Emeron Marcelle
Emeron Marcelle
As a doctoral scholar in Information Technology, I am deeply immersed in the world of artificial intelligence, with a specific focus on advancing the field. Fueled by a strong passion for Machine Learning and Artificial Intelligence, I am dedicated to acquiring the skills necessary to drive growth and innovation in this dynamic field. With a commitment to continuous learning and a desire to contribute innovative ideas, I am on a path to make meaningful contributions to the ever-evolving landscape of Machine Learning.