Understanding Regular Expressions in Python

Emeron MarcelleEmeron Marcelle
6 min read

Regular expressions (regex) are a powerful tool for matching patterns in text. Python's re module provides support for working with regular expressions, allowing you to search, match, and manipulate strings with ease. In this blog post, we will explore various regex patterns and how to use them in Python, using the notes provided.

Basic Patterns

  1. .: Matches any single character except newline \n

     import re
     text = "hello world"
     match = re.findall(".", text)
     print(match)  # ['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']
    
  2. ^: Matches the start of a string

     text = "hello world"
     match = re.match("^hello", text)
     print(match)  # <re.Match object; span=(0, 5), match='hello'>
    
  3. $: Matches the end of a string

     match = re.search("world$", text)
     print(match)  # <re.Match object; span=(6, 11), match='world'>
    
  4. \b: Matches a word boundary

     match = re.search(r"\bworld\b", text)
     print(match)  # <re.Match object; span=(6, 11), match='world'>
    
  5. \B: Matches a non-word boundary

     match = re.search(r"\Bworld", "hello world!")
     print(match)  # None
    

Character Classes

  1. \d: Matches any digit

     match = re.findall(r"\d", "123abc456")
     print(match)  # ['1', '2', '3', '4', '5', '6']
    
  2. \D: Matches any non-digit character

     match = re.findall(r"\D", "123abc456")
     print(match)  # ['a', 'b', 'c']
    
  3. \w: Matches any alphanumeric character

     match = re.findall(r"\w", "hello_world123")
     print(match)  # ['h', 'e', 'l', 'l', 'o', '_', 'w', 'o', 'r', 'l', 'd', '1', '2', '3']
    
  4. \W: Matches any non-alphanumeric character

     match = re.findall(r"\W", "hello world!")
     print(match)  # [' ', '!']
    
  5. \s: Matches any whitespace character

    match = re.findall(r"\s", "hello world")
    print(match)  # [' ']
    
  6. \S: Matches any non-whitespace character

    match = re.findall(r"\S", "hello world")
    print(match)  # ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']
    

Brackets and Groups

  1. []: Matches any single character within the brackets

    match = re.findall(r"[aeiou]", "hello world")
    print(match)  # ['e', 'o', 'o']
    
  2. [^]: Matches any single character not within the brackets

    match = re.findall(r"[^aeiou]", "hello world")
    print(match)  # ['h', 'l', 'l', ' ', 'w', 'r', 'l', 'd']
    
  3. (): Groups multiple tokens together and captures the matched text

    match = re.search(r"(hello) (world)", "hello world")
    print(match.groups())  # ('hello', 'world')
    
  4. |: Alternation; matches either the pattern before or after the pipe symbol

    match = re.search(r"hello|world", "hello world")
    print(match.group())  # 'hello'
    

Quantifiers

Quantifiers in regular expressions define how many times the preceding element (character or group) must occur for a match to be found. They provide flexibility and power in pattern matching. Let's delve deeper into the most common quantifiers with examples.

*: Matches Zero or More Occurrences

The * quantifier matches zero or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, including none.

Example:

import re

pattern = r"he*"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['heee', 'he', 'h']

Explanation:

  • heee: Matches because h is followed by three es.

  • he: Matches because h is followed by one e.

  • h: Matches because h is followed by zero es.

+: Matches One or More Occurrences

The + quantifier matches one or more occurrences of the preceding character or group. This means it will match as many occurrences as possible, but at least one must be present.

Example:

pattern = r"he+"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['heee', 'he']

Explanation:

  • heee: Matches because h is followed by three es.

  • he: Matches because h is followed by one e.

?: Matches Zero or One Occurrence

The ? quantifier matches zero or one occurrence of the preceding character or group. This means it will match at most one occurrence.

Example:

pattern = r"he?"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['he', 'he', 'h']

Explanation:

  • he: Matches because h is followed by one e.

  • he: Matches because h is followed by one e.

  • h: Matches because h is followed by zero es.

{}: Exact Number or Range of Occurrences

The {} quantifier specifies the exact number or range of occurrences of the preceding character or group.

Example:

pattern = r"he{2}"
text = "heeello helo"

match = re.findall(pattern, text)
print(match)  # Output: ['hee']

Explanation:

  • hee: Matches because h is followed by exactly two es.

.*, .*?: Greedy and Non-Greedy Quantifiers

  • .*: Greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as much text as possible.

  • .*?: Non-greedy quantifier, matches zero or more occurrences of any character (except a newline), trying to match as little text as possible.

Greedy Example:

pattern = r"<.*>"
text = "<tag>content</tag>"

match = re.search(pattern, text)
print(match.group())  # Output: '<tag>content</tag>'

Explanation:

  • .*: Matches everything from the first < to the last >, resulting in the entire string <tag>content</tag> being matched.

Non-Greedy Example:

pattern = r"<.*?>"
text = "<tag>content</tag>"

match = re.search(pattern, text)
print(match.group())  # Output: '<tag>'

Explanation:

  • .*?: Matches as little as possible, stopping at the first >, resulting in <tag> being matched.

Lookahead and Lookbehind Assertions

  1. Lookahead and lookbehind assertions are advanced regular expression techniques that allow you to match a pattern only if it is followed or preceded by another pattern, without including those patterns in the match. They are essential tools for creating complex and precise regex patterns. Let's explore these assertions in detail with examples.

    Positive Lookahead Assertion (?=...)

    A positive lookahead assertion (?=...) ensures that the specified pattern exists after the current position without including it in the match.

    Example:

    pythonCopy codeimport re
    
    pattern = r"\w+(?=\s)"
    text = "hello world"
    
    match = re.search(pattern, text)
    print(match.group())  # Output: 'hello'
    

    Explanation:

    • \w+: Matches one or more alphanumeric characters.

    • (?=\s): Ensures that the matched characters are followed by a whitespace character.

    • The pattern matches hello because hello is followed by a space.

Negative Lookahead Assertion (?!...)

A negative lookahead assertion (?!...) ensures that the specified pattern does not exist after the current position.

Example:

    pythonCopy codepattern = r"\w+(?!\s)"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'world'

Explanation:

  • \w+: Matches one or more alphanumeric characters.

  • (?!\s): Ensures that the matched characters are not followed by a whitespace character.

  • The pattern matches world because world is not followed by a space.

Positive Lookbehind Assertion (?<=...)

A positive lookbehind assertion (?<=...) ensures that the specified pattern exists before the current position without including it in the match.

Example:

    pythonCopy codepattern = r"(?<=\s)\w+"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'world'

Explanation:

  • (?<=\s): Ensures that the matched characters are preceded by a whitespace character.

  • \w+: Matches one or more alphanumeric characters.

  • The pattern matches world because world is preceded by a space.

Negative Lookbehind Assertion (?<!...)

A negative lookbehind assertion (?<!...) ensures that the specified pattern does not exist before the current position.

Example:

    pythonCopy codepattern = r"(?<!\s)\w+"
    text = "hello world"

    match = re.search(pattern, text)
    print(match.group())  # Output: 'hello'

Explanation:

  • (?<!\s): Ensures that the matched characters are not preceded by a whitespace character.

  • \w+: Matches one or more alphanumeric characters.

  • The pattern matches hello because hello is not preceded by a space.

Conclusion

Regular expressions are a versatile tool for text processing, enabling you to match, search, and manipulate strings efficiently. By mastering these patterns and understanding how to use them in Python, you can perform complex text operations with ease. Whether you're parsing logs, validating input, or extracting data, regex provides the flexibility and power needed for a wide range of tasks.

0
Subscribe to my newsletter

Read articles from Emeron Marcelle directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Emeron Marcelle
Emeron Marcelle

As a doctoral scholar in Information Technology, I am deeply immersed in the world of artificial intelligence, with a specific focus on advancing the field. Fueled by a strong passion for Machine Learning and Artificial Intelligence, I am dedicated to acquiring the skills necessary to drive growth and innovation in this dynamic field. With a commitment to continuous learning and a desire to contribute innovative ideas, I am on a path to make meaningful contributions to the ever-evolving landscape of Machine Learning.