Dear Excel: It's Not Me, It's You

Shani RiversShani Rivers
11 min read

The moment I found myself staring at a formula that was so long that stretched across my screen that I needed to have a scroll bar for, I knew I had reached my breaking point with Excel.

It was peppered with nested IF statements, ISBLANK and CONCAT functions that looked like they were having an argument with each other.

=UPPER(CONCAT(F2," ",G2," ",H2,IF(ISBLANK(I2),"",CONCAT(" ",I2)),IF(ISBLANK(J2),"",CONCAT(" ",J2)),IF(ISBLANK(K2),"",CONCAT(" ",K2)),IF(ISBLANK(L2)," ",CONCAT(,L2)), TRIM(C2), ", ", TRIM(D2), " ", TRIM(LEFT(E2,5))))
=UPPER(CONCAT("('",F2," ",G2,"%",H2,IF(ISBLANK(I2),"", CONCAT(" ", I2)), IF(ISBLANK(J2),"", CONCAT("%", J2)), IF(ISBLANK(K2),"", CONCAT("%", K2)), IF(ISBLANK(L2),"%'),", CONCAT("%", L2))))

Excel and I had been together for years, but this address-cleaning project was the final straw.

I’ll admit it: I used to be that person who defended Excel to the death. ‘You can do anything in Excel!’ I’d proclaim, all while crafting increasingly elaborate formulas that resembled some sort of an ancient incantation by a spreadsheet sorceress — it’s time to admit that maybe, just maybe, I’m using the wrong tool for the job.

Excel is really good at reporting and sometimes even doing light data cleanup, but if you are cleaning addresses in the 100s or 1000s that came in through a form with absolutely the bare minimum of data restrictions and validations, there’s really only so much it can do.

But before I could use any of those massive functions above, I had to manually click a button to do a ‘Text to Columns’ on the street address and click around to make sure that it was the correct delimiter and manually update the datatypes for each of the parts.

And then and only then, could I use them.

The second one would do some formatting of the address parts so that it could add some wildcards after cleaning it up a bit so I could paste it into my SQL script.

After about the 20th time going through this process, I knew I could not or would not want to do this anymore, especially if I had to clean anything beyond 30 addresses.

It’ll make you want to chuck your laptop out a 10 story window.

And we all know that Excel has the occasional hissy fit if I try to copy and paste values to anything outside of it — or sometimes, even inside of it — for that matter.

This is where Python is pretty much better at everything.

Here’s the breakdown of my Python code, but feel free to skip to the end to see it in all it’s beautiful entirety.

Let’s dive in, shall we.

How to Escape the 6th Circle of Excel Hell

Even though those nightmare Excel formulas that I created earlier looked like they were written by someone having a mental breakdown — I’ll share with you the code that replaced that madness and manual labor with something that’s actually readable, modular, and maintainable.

What The Code Actually Does

Before we dive into each function, let’s set the backstory — you’ve got messy address data from a database query (why would anyone upload dirty data into a database is a story for another day), and you need to:

1. Clean it up — remove nulls, trim whitespaces, standardize street types and remove delimiting characters like newlines and tabs.

2. Format it nicely for any humans that might want to read them.

3. Create address SQL wildcards to add to a temp table to use in other query scripts.

Instead of wrestling with Excel’s “Text to Columns” form nonsense over and over again, or dragging formulas down thousands of rows every time — you just paste your data as a string (into a variable or not, if you are lazy) and call one function. Done.

Function #1: create_set() or The Foundation

What it does: Takes your copy-pasted data mess and turns it into a clean, sorted list with no duplicates. This is to mimic the columns and rows in the spreadsheet, as I pasted in the addresses directly from the database and into a docstring.

def create_set(some_dstr, substr_index=0):
    clean_str =  some_dstr.split('\n')    #1
    sort_set = {str(item).rstrip().upper() for item in clean_str if item != 'NULL'} #2
    if substr_index > 0:    #3
        sort_set = {item[:substr_index] for item in sort_set}
    for i in sorted(sort_set): #4
        print(i)
    return sorted(sort_set)

Step by step:

  1. some_dstr.split('\n') - Splits your pasted data at each line break

  2. The set comprehension {str(item).rstrip().upper() for item in clean_str if item != 'NULL'} does 5 things at once:

    • Converts everything to strings using the built-in str() function

    • Strips trailing whitespace with rstrip()

    • Makes everything uppercase, upper()

    • Filters out any "NULL" values with an inline if statement

    • Finally, it makes sure there are no duplicate addresses, because that’s what sets do.

  3. If you want just the first few characters of each item (like for pattern matching), substr_index handles that

  4. Prints everything sorted in a for loop, so you can see what you're working with

Why this beats clicking around in Excel: Because I refuse to manually select rows or columns and stuff.

Function #2: clean_address() or Tab-Separated Chaos, Solved

What it does: Takes addresses that are probably tab-separated (thanks, database query exports or copy-paste) and breaks them into clean components.

def clean_address(addr_list):
    clean_list = []
    for item in addr_list:
        wip_addr = str(item).split('\t') #1
        strip_addr = [str(item).rstrip() for item in wip_addr]  #2
        clean_addr = [x for x in strip_addr if x != "NULL"]    #3
        clean_list.append(clean_addr)    #4
    return clean_list

Step by step:

  1. str(item).split('\t') - Splits each address on tab characters (because that's how database results usually come out if you copy and paste them)

  2. [str(item).rstrip() for item in wip_addr] - Strips trailing spaces from each piece of the address and makes them strings, while you also do the list comprehension - 2 birds one stone.

  3. [x for x in strip_addr if x != "NULL"] - Throws out any "NULL" values

  4. Builds a list where each address is now a clean list of its components

Why this beats Excel: Remember the “Text to Columns” dance? Making sure you have enough columns to accommodate the splitting value results and then having to specify delimiters and data types for each and every piece — manually — by scrolling and clicking all the pieces to set them? Yeah, this does all that in 4 lines of code.

Function #3: pretty_print_addr_list() Or Making Things Human-Readable

What it does: Takes your cleaned address components and makes them look like actual addresses that humans can read.

def pretty_print_addr_list(addr_list):
    for addr in addr_list: # 1
        addr_str = ""
        for item in addr:
            addr_str = addr_str + " " + item. # 2
        print(addr_str) # 3

Step by step:

  1. Goes through each address in your list

  2. Joins all the components back together with spaces

  3. Prints each complete address on its own line

Why this beats Excel: No more concatenation formulas that look like =CONCAT(A1," ",B1," ",C1,IF(ISBLANK(D1),"",CONCAT(" ",D1))) and make you want to question your life choices.

Function #4: get_address_wildcards() - The SQL Code Magic Maker

What it does: Creates SQL wildcard patterns from street addresses for my queries.

def get_address_wildcards(addr_list):
    wildcard_list = []
    for item in addr_list:
        clean_street = str(item[0]).split(' ')
        strip_street = [str(part).strip() for part in clean_street]
        combine_street = ""

        for i in range(len(strip_street)):
            if i > 0:
                if strip_street[i] == 'RD':
                    combine_street += 'R%'
                elif strip_street[i] == 'BLVD':
                    combine_street += 'B%'
                elif strip_street[i] == 'STREET':
                    combine_street += 'ST%'
                else:
                    combine_street = combine_street + strip_street[i] + '%'
            else:
                combine_street = strip_street[i] + ' '

        wildcard_list.append(combine_street)

    addr_set = {addr for addr in wildcard_list}
    return sorted(addr_set)

Step by step:

  • Takes the first part of each address (the street part)

  • Splits it into individual words

  • For the first word, just adds a space

  • The common words for the pavement that cars drive on and replace them with a wildcard that will also catch their abbreviations:

    • "RD" becomes "R%" (to match "RD", "ROAD", etc.)

    • "BLVD" becomes "B%" (to match "BLVD", "BOULEVARD", etc.)

    • "STREET" becomes "ST%" (to match "ST", "STREET", etc.)

    • Everything else gets a "%" wildcard after it

  • Returns unique patterns, sorted

Why this beats Excel: Try building dynamic SQL wildcards with a bunch nested IF statements, actually don’t, it’s not pretty.

Function #5: address_cleaner() The One-and-Done Solution

What it does: Runs the entire pipeline with one function call.

def address_cleaner(dirty_addr_doctring):
    addr_set = create_set(dirty_addr_doctring, substr_index=0) #1
    clean_set = clean_address(addr_set)  # 2
    print('\n--- Pretty printed --- \n')
    pretty_print_addr_list(clean_set)    #3
    wild_set = get_address_wildcards(clean_set)    #4
    print('\n --- Wildcards --- \n')    #5
    for addr in wild_set:
        print("('" + addr + "'), ")    #6

The workflow:

  1. Takes your messy copy-pasted data

  2. Creates a clean, unique set

  3. Cleans the address components

  4. Prints the readable version

  5. Creates SQL wildcards

  6. Prints the wildcards in a format ready to paste into your SQL script

How to use it:

messy_address_data = """
123      MAIN ST     STE    123        ANYTOWN    CA    98766
123 MAIN STREET STE 123        ANYTOWN    CA    98766
456 OAK ROD  NULL                       SOMEWHERE  TX    10010
123               MAIN            STREET  NULL              ANYTOWN    CA          12345
789 MAPLE             BLVD           NULL                      NOWHERE    





FL                    09999
"""

address_cleaner(messy_address_data)
--- Pretty printed --- 
123 MAIN ST STE 123 ANYTOWN CA 98766
456 OAK RD SOMEWHERE TX 10010
123 MAIN ST ANYTOWN CA 12345
789 MAPLE BLVD NOWHERE FL 09999

--- Wildcards --- 
('123 MAIN%ST%STE%123%'),
('456 OAK R%'),
('123 MAIN ST%'),
('789 MAPLE B%')

And just like that… You get clean addresses and SQL-ready wildcards without touching a single Excel ribbon, button, form or function!

The Conclusion

Here’s one of the things that a data analyst does, is to figure out how to get data as clean as it can possibly be for reporting, and if you are dealing with hundreds of thousands of rows of data, Excel is a beast to work with.

This code replaces:

  1. Clicking through Excel’s “Text to Columns” form nonsense for every single batch of addresses I need to clean

  2. Dragging formulas down endless rows like it’s 1999

  3. Worrying about Excel’s row limits when you’re dealing with large datasets

  4. Maintaining complex concatenation formulas with their conditions

  5. Replacing the general urge to defenestrate your laptop with a zen-like sense of accomplishment.

Paste. Call function. Done.

If all this code looked confusing, I am working on a book that breaks down a lot of the code that you see here.

You can read the first few chapters of “Python for Doers” on Leanpub for free or feel free to dig around in Python’s documentation.

Happy cleaning, friends!

The whole script in all it’s glory

import csv

"""-------------------      GENERAL FUNCTIONS       ------------------"""
def create_set(some_dstr, substr_index=0, pretty_print=False):
    """Create a set of things. This could be anything really that you don't 
    have to worry about duplicates and can do some basic clean up on it, as 
    long as it has newline delimiters

    Args:
        some_dstr: a docstring of whatever it is
        substr_index: if you want to return just a substring of the items,
            this is useful if you need to build out wildcard values
    returns:
        sorted list of strings in the correct format to copy and paste 
        into script or use in another function
    """
    clean_str = some_dstr.split('\n')
    sort_set = {str(item).rstrip().upper() for item in clean_str if item != 'NULL'}
    if substr_index > 0:
        sort_set = {item[:substr_index] for item in sort_set}
    if pretty_print: 
        for i in sorted(sort_set):
            print(i)
    return sorted(sort_set)

def output_csv(list_of_list, file_name='output_writer.csv'):
    """Creates and outputs a csv file of the clean up, use this 
        to deal with list larger than a few hundred and you don't want
        your IDE to throw a hissy fit over

        args:
            list_of_list: This has to be a list of list of things
            file_name: This is optional, but it will continue to overwrite 
                       the same file if you don't specify a name
    """
    data_rows = [item for item in list_of_list]

    csv_file_path_writer = file_name
    with open(csv_file_path_writer, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerows(data_rows)  # Write all rows at once

""" -----------------  ADDRESS FUNCTIONS ----------------- """
def clean_address(addr_list):
    """Cleans address list, removing whitespace from each item and any
    null string values.

    Args:
        addr_list: Address list
    returns:
        a clean list of addresses for use in another function
    """
    clean_list = []
    for item in addr_list:
        wip_addr = str(item).split('\t')
        strip_addr = [str(item).rstrip() for item in wip_addr]
        clean_addr = [x for x in strip_addr if x != "NULL"]
        clean_list.append(clean_addr)

    return clean_list

def pretty_print_addr_list(addr_list):
    """Formats an  an address list in a format for copying into script

    Args:
        addr_list: Takes clean address list
    returns:
        prints address in pretty formatting with no weird spacing to 
        copy and paste into script notes
    """
    addr_str = ""
    for addr in addr_list:
        component_str = ""
        for item in addr:
            component_str = component_str + " " + item
        addr_str = addr_str + component_str + '\n'

    return addr_str

def get_address_wildcards (addr_list):
    """
    Insert street part of address into a set. Does a split just in case there's
    weird spaces. Then puts the cleaned address pieces back together again.

    returns:
        street address with wildcard at end, some common street type 
        edits needed for constructing wildcards and formatting for script
    """
    wildcard_list = []
    for item in addr_list:
        clean_street = str(item[0]).split(' ')
        strip_street = [str(part).strip() for part in clean_street]
        combine_street = ""

        for i in range(len(strip_street)):
            if i > 0:
                if strip_street[i] == 'RD':
                    combine_street += 'R%'
                elif strip_street[i] == 'BLVD':
                    combine_street += 'B%'
                else:
                    combine_street = combine_street + strip_street[i] + '%'
            else:
                combine_street = strip_street[i] + ' '

        wildcard_list.append( combine_street)

    addr_set = {addr for addr in wildcard_list}
    return sorted(addr_set)

def run_address_cleaner(dirty_addr_doctring, pretty_print=True):
    """ Outputs and cleans addresses for script notes and for constructing 
        temp table forlarge script in one go. It takes the above 
        address functions and performs them all.

        args:
            some_docstring: some docstring of addresses directly copied 
                            from the database.
            pretty_print: will print out address to IDLE, otherwise if 
                          False, it will create a txt file of the pretty 
                          printed addresses
        returns:
            Default prints to IDLE, otherwise in 2 files, wildcards in a 
            csv and pretty in a txt, all formatting so all you need to do 
            is copy and paste into your script
    """

    addr_set = create_set(dirty_addr_doctring)
    clean_set = clean_address(addr_set)
    wild_set = get_address_wildcards(clean_set)
    pretty_print_text = pretty_print_addr_list(clean_set)
    if pretty_print:
        print('\n--- Pretty printed --- \n')
        print(pretty_print_text)

        print('\n --- Wildcards --- \n')
        for addr in wild_set:
            print( "('" + addr + "'), ")
    else:
        data_list = [["('" + addr + "'), "] for addr in wild_set]
        output_csv(data_list, "address.csv")

        with open("pretty_address.txt", 'w') as file:
            file.write(pretty_print_text)

Note: I did add some code to allow the option to output results as either a csv file for the wildcards and a txt file for the pretty printed address for any list of addresses that exceed a certain count, but it’s manual based on whether I feel like I want it to output as a file or not.

0
Subscribe to my newsletter

Read articles from Shani Rivers directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shani Rivers
Shani Rivers

I'm a data enthusiast with web development and graphic design experience. I know how to build a website from scratch, but I sometimes work with entrepreneurs and small businesses with their Wix or Squarespace sites, so basically I help them to establish their online presence. When I'm not doing that, I'm studying and blogging about data engineering, data science or web development. Oh, and I'm wife and a mom to a wee tot.