Dear Excel: It's Not Me, It's You

Table of contents
- How to Escape the 6th Circle of Excel Hell
- Function #1: create_set() or The Foundation
- Function #2: clean_address() or Tab-Separated Chaos, Solved
- Function #3: pretty_print_addr_list() Or Making Things Human-Readable
- Function #4: get_address_wildcards() - The SQL Code Magic Maker
- Function #5: address_cleaner() The One-and-Done Solution
- The Conclusion

The moment I found myself staring at a formula that was so long that stretched across my screen that I needed to have a scroll bar for, I knew I had reached my breaking point with Excel.
It was peppered with nested IF statements, ISBLANK
and CONCAT
functions that looked like they were having an argument with each other.
=UPPER(CONCAT(F2," ",G2," ",H2,IF(ISBLANK(I2),"",CONCAT(" ",I2)),IF(ISBLANK(J2),"",CONCAT(" ",J2)),IF(ISBLANK(K2),"",CONCAT(" ",K2)),IF(ISBLANK(L2)," ",CONCAT(,L2)), TRIM(C2), ", ", TRIM(D2), " ", TRIM(LEFT(E2,5))))
=UPPER(CONCAT("('",F2," ",G2,"%",H2,IF(ISBLANK(I2),"", CONCAT(" ", I2)), IF(ISBLANK(J2),"", CONCAT("%", J2)), IF(ISBLANK(K2),"", CONCAT("%", K2)), IF(ISBLANK(L2),"%'),", CONCAT("%", L2))))
Excel and I had been together for years, but this address-cleaning project was the final straw.
I’ll admit it: I used to be that person who defended Excel to the death. ‘You can do anything in Excel!’ I’d proclaim, all while crafting increasingly elaborate formulas that resembled some sort of an ancient incantation by a spreadsheet sorceress — it’s time to admit that maybe, just maybe, I’m using the wrong tool for the job.
Excel is really good at reporting and sometimes even doing light data cleanup, but if you are cleaning addresses in the 100s or 1000s that came in through a form with absolutely the bare minimum of data restrictions and validations, there’s really only so much it can do.
But before I could use any of those massive functions above, I had to manually click a button to do a ‘Text to Columns’ on the street address and click around to make sure that it was the correct delimiter and manually update the datatypes for each of the parts.
And then and only then, could I use them.
The second one would do some formatting of the address parts so that it could add some wildcards after cleaning it up a bit so I could paste it into my SQL script.
After about the 20th time going through this process, I knew I could not or would not want to do this anymore, especially if I had to clean anything beyond 30 addresses.
It’ll make you want to chuck your laptop out a 10 story window.
And we all know that Excel has the occasional hissy fit if I try to copy and paste values to anything outside of it — or sometimes, even inside of it — for that matter.
This is where Python is pretty much better at everything.
Here’s the breakdown of my Python code, but feel free to skip to the end to see it in all it’s beautiful entirety.
Let’s dive in, shall we.
How to Escape the 6th Circle of Excel Hell
Even though those nightmare Excel formulas that I created earlier looked like they were written by someone having a mental breakdown — I’ll share with you the code that replaced that madness and manual labor with something that’s actually readable, modular, and maintainable.
What The Code Actually Does
Before we dive into each function, let’s set the backstory — you’ve got messy address data from a database query (why would anyone upload dirty data into a database is a story for another day), and you need to:
1. Clean it up — remove nulls, trim whitespaces, standardize street types and remove delimiting characters like newlines and tabs.
2. Format it nicely for any humans that might want to read them.
3. Create address SQL wildcards to add to a temp table to use in other query scripts.
Instead of wrestling with Excel’s “Text to Columns” form nonsense over and over again, or dragging formulas down thousands of rows every time — you just paste your data as a string (into a variable or not, if you are lazy) and call one function. Done.
Function #1: create_set()
or The Foundation
What it does: Takes your copy-pasted data mess and turns it into a clean, sorted list with no duplicates. This is to mimic the columns and rows in the spreadsheet, as I pasted in the addresses directly from the database and into a docstring.
def create_set(some_dstr, substr_index=0):
clean_str = some_dstr.split('\n') #1
sort_set = {str(item).rstrip().upper() for item in clean_str if item != 'NULL'} #2
if substr_index > 0: #3
sort_set = {item[:substr_index] for item in sort_set}
for i in sorted(sort_set): #4
print(i)
return sorted(sort_set)
Step by step:
some_dstr.split('\n')
- Splits your pasted data at each line breakThe set comprehension
{str(item).rstrip().upper() for item in clean_str if item != 'NULL'}
does 5 things at once:Converts everything to strings using the built-in
str()
functionStrips trailing whitespace with
rstrip()
Makes everything uppercase,
upper()
Filters out any "NULL" values with an inline
if
statementFinally, it makes sure there are no duplicate addresses, because that’s what sets do.
If you want just the first few characters of each item (like for pattern matching),
substr_index
handles thatPrints everything sorted in a
for
loop, so you can see what you're working with
Why this beats clicking around in Excel: Because I refuse to manually select rows or columns and stuff.
Function #2: clean_address()
or Tab-Separated Chaos, Solved
What it does: Takes addresses that are probably tab-separated (thanks, database query exports or copy-paste) and breaks them into clean components.
def clean_address(addr_list):
clean_list = []
for item in addr_list:
wip_addr = str(item).split('\t') #1
strip_addr = [str(item).rstrip() for item in wip_addr] #2
clean_addr = [x for x in strip_addr if x != "NULL"] #3
clean_list.append(clean_addr) #4
return clean_list
Step by step:
str(item).split('\t')
- Splits each address on tab characters (because that's how database results usually come out if you copy and paste them)[str(item).rstrip() for item in wip_addr]
- Strips trailing spaces from each piece of the address and makes them strings, while you also do the list comprehension - 2 birds one stone.[x for x in strip_addr if x != "NULL"]
- Throws out any "NULL" valuesBuilds a list where each address is now a clean list of its components
Why this beats Excel: Remember the “Text to Columns” dance? Making sure you have enough columns to accommodate the splitting value results and then having to specify delimiters and data types for each and every piece — manually — by scrolling and clicking all the pieces to set them? Yeah, this does all that in 4 lines of code.
Function #3: pretty_print_addr_list()
Or Making Things Human-Readable
What it does: Takes your cleaned address components and makes them look like actual addresses that humans can read.
def pretty_print_addr_list(addr_list):
for addr in addr_list: # 1
addr_str = ""
for item in addr:
addr_str = addr_str + " " + item. # 2
print(addr_str) # 3
Step by step:
Goes through each address in your list
Joins all the components back together with spaces
Prints each complete address on its own line
Why this beats Excel: No more concatenation formulas that look like =CONCAT(A1," ",B1," ",C1,IF(ISBLANK(D1),"",CONCAT(" ",D1)))
and make you want to question your life choices.
Function #4: get_address_wildcards()
- The SQL Code Magic Maker
What it does: Creates SQL wildcard patterns from street addresses for my queries.
def get_address_wildcards(addr_list):
wildcard_list = []
for item in addr_list:
clean_street = str(item[0]).split(' ')
strip_street = [str(part).strip() for part in clean_street]
combine_street = ""
for i in range(len(strip_street)):
if i > 0:
if strip_street[i] == 'RD':
combine_street += 'R%'
elif strip_street[i] == 'BLVD':
combine_street += 'B%'
elif strip_street[i] == 'STREET':
combine_street += 'ST%'
else:
combine_street = combine_street + strip_street[i] + '%'
else:
combine_street = strip_street[i] + ' '
wildcard_list.append(combine_street)
addr_set = {addr for addr in wildcard_list}
return sorted(addr_set)
Step by step:
Takes the first part of each address (the street part)
Splits it into individual words
For the first word, just adds a space
The common words for the pavement that cars drive on and replace them with a wildcard that will also catch their abbreviations:
"RD" becomes "R%" (to match "RD", "ROAD", etc.)
"BLVD" becomes "B%" (to match "BLVD", "BOULEVARD", etc.)
"STREET" becomes "ST%" (to match "ST", "STREET", etc.)
Everything else gets a "%" wildcard after it
Returns unique patterns, sorted
Why this beats Excel: Try building dynamic SQL wildcards with a bunch nested IF statements, actually don’t, it’s not pretty.
Function #5: address_cleaner()
The One-and-Done Solution
What it does: Runs the entire pipeline with one function call.
def address_cleaner(dirty_addr_doctring):
addr_set = create_set(dirty_addr_doctring, substr_index=0) #1
clean_set = clean_address(addr_set) # 2
print('\n--- Pretty printed --- \n')
pretty_print_addr_list(clean_set) #3
wild_set = get_address_wildcards(clean_set) #4
print('\n --- Wildcards --- \n') #5
for addr in wild_set:
print("('" + addr + "'), ") #6
The workflow:
Takes your messy copy-pasted data
Creates a clean, unique set
Cleans the address components
Prints the readable version
Creates SQL wildcards
Prints the wildcards in a format ready to paste into your SQL script
How to use it:
messy_address_data = """
123 MAIN ST STE 123 ANYTOWN CA 98766
123 MAIN STREET STE 123 ANYTOWN CA 98766
456 OAK ROD NULL SOMEWHERE TX 10010
123 MAIN STREET NULL ANYTOWN CA 12345
789 MAPLE BLVD NULL NOWHERE
FL 09999
"""
address_cleaner(messy_address_data)
--- Pretty printed ---
123 MAIN ST STE 123 ANYTOWN CA 98766
456 OAK RD SOMEWHERE TX 10010
123 MAIN ST ANYTOWN CA 12345
789 MAPLE BLVD NOWHERE FL 09999
--- Wildcards ---
('123 MAIN%ST%STE%123%'),
('456 OAK R%'),
('123 MAIN ST%'),
('789 MAPLE B%')
And just like that… You get clean addresses and SQL-ready wildcards without touching a single Excel ribbon, button, form or function!
The Conclusion
Here’s one of the things that a data analyst does, is to figure out how to get data as clean as it can possibly be for reporting, and if you are dealing with hundreds of thousands of rows of data, Excel is a beast to work with.
This code replaces:
Clicking through Excel’s “Text to Columns” form nonsense for every single batch of addresses I need to clean
Dragging formulas down endless rows like it’s 1999
Worrying about Excel’s row limits when you’re dealing with large datasets
Maintaining complex concatenation formulas with their conditions
Replacing the general urge to defenestrate your laptop with a zen-like sense of accomplishment.
Paste. Call function. Done.
If all this code looked confusing, I am working on a book that breaks down a lot of the code that you see here.
You can read the first few chapters of “Python for Doers” on Leanpub for free or feel free to dig around in Python’s documentation.
Happy cleaning, friends!
The whole script in all it’s glory
import csv
"""------------------- GENERAL FUNCTIONS ------------------"""
def create_set(some_dstr, substr_index=0, pretty_print=False):
"""Create a set of things. This could be anything really that you don't
have to worry about duplicates and can do some basic clean up on it, as
long as it has newline delimiters
Args:
some_dstr: a docstring of whatever it is
substr_index: if you want to return just a substring of the items,
this is useful if you need to build out wildcard values
returns:
sorted list of strings in the correct format to copy and paste
into script or use in another function
"""
clean_str = some_dstr.split('\n')
sort_set = {str(item).rstrip().upper() for item in clean_str if item != 'NULL'}
if substr_index > 0:
sort_set = {item[:substr_index] for item in sort_set}
if pretty_print:
for i in sorted(sort_set):
print(i)
return sorted(sort_set)
def output_csv(list_of_list, file_name='output_writer.csv'):
"""Creates and outputs a csv file of the clean up, use this
to deal with list larger than a few hundred and you don't want
your IDE to throw a hissy fit over
args:
list_of_list: This has to be a list of list of things
file_name: This is optional, but it will continue to overwrite
the same file if you don't specify a name
"""
data_rows = [item for item in list_of_list]
csv_file_path_writer = file_name
with open(csv_file_path_writer, mode='w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data_rows) # Write all rows at once
""" ----------------- ADDRESS FUNCTIONS ----------------- """
def clean_address(addr_list):
"""Cleans address list, removing whitespace from each item and any
null string values.
Args:
addr_list: Address list
returns:
a clean list of addresses for use in another function
"""
clean_list = []
for item in addr_list:
wip_addr = str(item).split('\t')
strip_addr = [str(item).rstrip() for item in wip_addr]
clean_addr = [x for x in strip_addr if x != "NULL"]
clean_list.append(clean_addr)
return clean_list
def pretty_print_addr_list(addr_list):
"""Formats an an address list in a format for copying into script
Args:
addr_list: Takes clean address list
returns:
prints address in pretty formatting with no weird spacing to
copy and paste into script notes
"""
addr_str = ""
for addr in addr_list:
component_str = ""
for item in addr:
component_str = component_str + " " + item
addr_str = addr_str + component_str + '\n'
return addr_str
def get_address_wildcards (addr_list):
"""
Insert street part of address into a set. Does a split just in case there's
weird spaces. Then puts the cleaned address pieces back together again.
returns:
street address with wildcard at end, some common street type
edits needed for constructing wildcards and formatting for script
"""
wildcard_list = []
for item in addr_list:
clean_street = str(item[0]).split(' ')
strip_street = [str(part).strip() for part in clean_street]
combine_street = ""
for i in range(len(strip_street)):
if i > 0:
if strip_street[i] == 'RD':
combine_street += 'R%'
elif strip_street[i] == 'BLVD':
combine_street += 'B%'
else:
combine_street = combine_street + strip_street[i] + '%'
else:
combine_street = strip_street[i] + ' '
wildcard_list.append( combine_street)
addr_set = {addr for addr in wildcard_list}
return sorted(addr_set)
def run_address_cleaner(dirty_addr_doctring, pretty_print=True):
""" Outputs and cleans addresses for script notes and for constructing
temp table forlarge script in one go. It takes the above
address functions and performs them all.
args:
some_docstring: some docstring of addresses directly copied
from the database.
pretty_print: will print out address to IDLE, otherwise if
False, it will create a txt file of the pretty
printed addresses
returns:
Default prints to IDLE, otherwise in 2 files, wildcards in a
csv and pretty in a txt, all formatting so all you need to do
is copy and paste into your script
"""
addr_set = create_set(dirty_addr_doctring)
clean_set = clean_address(addr_set)
wild_set = get_address_wildcards(clean_set)
pretty_print_text = pretty_print_addr_list(clean_set)
if pretty_print:
print('\n--- Pretty printed --- \n')
print(pretty_print_text)
print('\n --- Wildcards --- \n')
for addr in wild_set:
print( "('" + addr + "'), ")
else:
data_list = [["('" + addr + "'), "] for addr in wild_set]
output_csv(data_list, "address.csv")
with open("pretty_address.txt", 'w') as file:
file.write(pretty_print_text)
Note: I did add some code to allow the option to output results as either a csv file for the wildcards and a txt file for the pretty printed address for any list of addresses that exceed a certain count, but it’s manual based on whether I feel like I want it to output as a file or not.
Subscribe to my newsletter
Read articles from Shani Rivers directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Shani Rivers
Shani Rivers
I'm a data enthusiast with web development and graphic design experience. I know how to build a website from scratch, but I sometimes work with entrepreneurs and small businesses with their Wix or Squarespace sites, so basically I help them to establish their online presence. When I'm not doing that, I'm studying and blogging about data engineering, data science or web development. Oh, and I'm wife and a mom to a wee tot.