Part 6: Build a word counter with Python


Up to now, our programs have interacted with the world through user input and printing to the screen. But what if you want our data to live on after you shut down your computer?
Today, we'll learn to read data from files and write to them. We'll also learn about command line arguments.
As always, I've structured the post around a standalone project. We'll first recreate a classic computer program to count the words and lines in a text file. If you use a Linux or Mac OS computer, you'll find this program by typing wc
in your terminal.
Next, we'll go one step further and find the number of occurrences of the 20 most used words in the file. We'll even visualize that with a simple text-based bar chart!
Finally, we'll save our results as a report in a file.
What's a file?
The question seems obvious. Everyone who has looked at a computer has heard about files. However, we need a more precise mental model since we will be working with them.
Files are ways to store data. You can close all your programs and shut down your computer. Your files will still be there when you reboot.
This is in contrast to data structures internal to your programs. Let's say you have a bunch of variables inside your program. Their "state" (i.e., their data) will disappear after your program exits.
Files are how you make data permanent.
In Python, files are "streams" of data. A stream is like a conveyor belt: you get the data sequentially. This contrasts with most other data structures, where you load everything at once in the RAM. For example, when you work with a string or a list, the whole thing is accessible. You don't have to go through list items 0 to 10 to access the eleventh item. Files are different. By default, you read or write to a file little by little (byte by byte, character by character, line by line, etc).
You might ask yourself why this matters. Sometimes, files are too big to load at once. For example, if you want to stream a 4K movie, you don't need the whole film, only the next images and sounds. You can watch the movies as they download instead of waiting. That's the gist of why we use streams.
First steps with files
First, let's create a simple text file to try some simple file manipulation.
-----BEGIN input_file.txt-------
Hello, world!
This is a test input!
-----END input_file.txt---------
The open
function takes many arguments, but we'll use two in our code:
a filename: a string, e.g.,
"my_filename.txt"
an optional mode: a code how you're opening the file. For example,
'r'
is for 'read mode','w'
is for 'write mode', etc.
Let's see how we can open our little test file. Launch Python in the directory/filepath/folder where you put your file.
>>> f = open('input_file.txt')
>>> f
<_io.TextIOWrapper name='input_file.txt' mode='r' encoding='UTF-8'>
We opened our file and stored the result in the f
variable. When we examine f
, we see a somewhat opaque TextIOWrapper
type. f
is a file object. By default, Python created it in "read" mode. Several reading methods exist. For example, f.readline()
reads a file line by line.
>>> f.readline()
'Hello, world!\n'
>>> f.readline()
'This is a test input!\n'
When we reach the end of the file, f.readline()
returns an empty string.
>>> f.readline()
''
As you can see, calling readline
repeatedly gets you successive lines. This illustrates the streaming nature of files.
We close a file object using the close
method.
>>> f.close()
So now we have a workflow to deal with files:
opening a file
doing something to it (e.g., reading lines)
closing the file
This approach has flaws. The most glaring one is that humans are messy. They always forget to clean up after themselves and won't close their files after using them.
While this may seem benign, it will waste your system resources or cause errors. Speaking of errors, what if your program fails before you close the file? The close
method might never run.
The standard Python approach to dealing with files is the "with open as" idiom. Look at how I read the two first lines in our input_file.txt
.
with open('input_file.txt') as f:
f.readline()
f.readline()
The first line means we create a f
variable holding a file object, which closes as soon as the inner block of code ends.
This will be our bread-and-butter approach to file handling from now on.
A simple word counter
I'll first create a testbed toy program, and we'll refine it later. It'll
go over each line of the file
split it into a list of words
count the words in the list
sum up each line's word count
"""
my_wc -- A word counter program
"""
wc = 0
with open("input_file.txt", "r") as f:
for line in f:
word_list = line.split()
word_list_len = len(word_list)
wc += word_list_len
print(wc)
Let's go through this line by line.
First, we create a word count variable, wc
. Since we haven't started counting, we initialize it with 0.
The next line is in the "with open as" idiom we saw earlier. I hard-coded the filename for now. The "r"
argument means we're opening the file in "read-only" mode. This parameter is not 100% necessary, as "r"
is the implied default for open
. Keeping code explicit is a good idea, so I included it here.
The body of the open block is a for loop. When looping over a text file, the code iterates line by line.
The next three lines each do one simple thing:
Split the line into words using whitespace as a delimiter. This creates the
word_list
list.Measure the number of words. This is the length of the
word_list
, that I callword_list_len
.Add the number of words to the counter.
Now that we've seen how it works, we can make the loop body more concise:
wc = len(line.split())
This example shows you can read a file without using the readline
method. It's enough to loop through it and call the split
method for each line.
Command-line arguments
I coded the name of the file in the source code. This is fine when exploring a simple concept, but this made the program useless. You want to be able to use the program with whatever text file you have.
Enter command-line arguments.
Let's say you have named the program my_wc.py
. You would invoke it from the command line like this:
% python3 my_wc.py
Command line arguments are extra arguments you pass to a program when you invoke it. Like this:
% python3 my_wc.py input_file.txt
Here, input_file.txt
is the argument.
There are several different possible solutions for command-line arguments. I'll show you the recommended standard.
The argparse module
The argparse
module is part of Python's built-in modules.
It has a 3-part workflow:
Create a "parser."
Add arguments to your parser
Use the parser
I'll take time to define parsers, as you'll probably come across them later. Sometimes, a program has to make sense of some text. For example, the Python interpreter has to understand the programs' structure to run them. A parser might not seem useful if we have only one command line argument (the filename) in our program. We'll add some more arguments later, though.
Importing argparse and creating the parser
import argparse
parser = argparse.ArgumentParser(
prog="my_wc.py",
description="word count and other file manipulation")
When creating a parser, we only make some scaffolding. ArgumentParser's arguments are metadata for the program. e.g., the program's name, a description, etc. You can check out all the arguments here.
Adding an argument to the parser
Now that we've got scaffolding in place in the form of the <parser> variable, we'll add a new argument.
parser.add_argument("filename",
type=str,
help="filename")
This means we now run our program from the command line like this.
% my_wc.py myfilename.txt
Python will understand that myfilename.txt
is our parser's "filename"
argument. type=str
means that we'll treat the argument as a string. Here are the possible arguments to the add_argument
method.
Using the parser
Now that our parser is in place, we will use it.
args = parser.parse_args()
wc = 0
with open(args.filename, "r") as f:
for line in f:
word_list = line.split(' ')
word_list_len = len(word_list)
wc += word_list_len
We call the parse_args
method on our parser. The results are the arguments the user gave when launching the program. We can store them in a variable. For example, args.filename
is now "my_filename.txt"
.
Let's test it.
% python3 my_wc.py input_file.txt
7
A nice side effect of using argparse is that it generates help pages.
% python3 my_wc.py -h
usage: my_wc.py [-h] filename
word count and other file manipulation
positional arguments:
filename filename
options:
-h, --help show this help message and exit
Using real-life data
We have used a small two-line text file. It's now time to pitch our code against real-world data. We'll see if our code is slow or if there's any strange case we haven't considered.
I'll use the text from Moby Dick that you can download from here.
If you're on Linux or Mac OS, you can download the ebook from the command line like this:
% curl https://www.gutenberg.org/cache/epub/2701/pg2701.txt -o moby_dick.txt
Our program is trying to mimic the Unix wc
utility. If you're using a Linux or Mac OS computer, you can double-check our program's results. For example, the two commands below should give a similar number of words.
% wc moby_dick.txt
22314 215838 1276288 moby_dick.txt
wc
outputs several numbers. Here, you must look at the second number for the word count (215838). Now to our program:
% python3 my_wc.py moby_dick.txt
220156
Huh? We end up with a higher word count than our reference implementation.
Can you find the bug?
While writing the program, I had a nagging feeling that something was wrong and would come back to bite me. You see, I split each line with word_list = line.split(' ')
. I used spaces as a separator. Consider the example below. First, I create a string containing whitespace (spaces and tabs) and a few lone letters.
>>> whitespace_string = "\t \t \t a a a"
The number of 'words' (I'm using a loose definition of a word here) is three: the three a's. But if we split the string using spaces as separators, we get a different picture. Tabs are on the same level as letters!
>>> split_1 = whitespace_string.split(' ')
>>> split_1
['\t', '\t', '\t', 'a', 'a', 'a']
We need to use all and every whitespace as a separator:
>>> split_2 = whitespace_string.split()
>>> split_2
['a', 'a', 'a']
The split
method without arguments does a good job. Let's include it in the program: word_list = line.split()
.
Try it. Now it works!
This bug shows the importance of good-quality testing material. We didn't catch the bug when we only had a homebrew two-line test file. Things changed when we used the whole Moby Dick text.
Adding a line counter
We'll add line counting since we now have a stable word counter.
We don't have to change much. When reading the file, we'll only add a count variable and increment it at each loop iteration.
lines = 0
wc = 0
with open(args.filename, "r") as f:
for line in f:
lines += 1
word_list = line.split()
word_list_len = len(word_list)
wc += word_list_len
print(f"lines: {lines}\nwords: {wc}")
% python3 my_wc.py moby_dick.txt
lines: 22314
words: 215838
Counting the most frequent words
Generating a token list
Our program is already useful, but I want to go beyond that. I want to plot the most frequent words in a text. For that, we must first turn our input text into discrete words. The jargon term for this is tokenization. A token is a unit of text resulting from splitting a sequence of text. Tokens can be words, subwords, characters, etc. In our case, tokens are words. For example, the sentence "The green cat eats the purple rat." gives the token list ["The", "green", "cat", "eats", "the", "purple", "cat"]
.
Here are the relevant parts of our code.
tokens = []
with open(args.filename, "r") as f:
for line in f:
lines += 1
word_list = line.split()
word_list_len = len(word_list)
wc += word_list_len
tokens += word_list
And here's what <word_list> looks like if we call our program on its source code. (the program output below is cropped)
['"""', 'wc', 'word', 'frequency', '"""', 'import', 'argparse',
///...///
Cleaning the tokens
As you can see, some tokens don't look like words at all ("" for example). This poses two problems
we get non-words into the list ("")
some words could have several forms (capitalized vs non-capitalized, for example).
We want to purge the tokens
list from unwanted characters.
cleaned = []
for token in tokens:
cleaned_token = token.strip(',.;:\\()\'\"\t\n\r')
if cleaned_token:
cleaned.append(cleaned_token)
This block of code
goes through each token in the list
strips it of unwanted characters
check if there are still some characters left after stripping (
if cleaned_token:
)if so, add the clean token to the
cleaned
list
The strip
method copies a string and removes the characters in an arguments string. Here, we remove commas, periods, colons, semi-colons, backslashes, parentheses, and single and double quotes.
Below is the list of cleaned tokens from calling our program using its source code.
['wc', 'word', 'frequency', 'import', 'argparse', 'import', 'pprint', 'parser',
///...///
Counting occurrences for each token
Now, our list of tokens is somewhat clean. Next, we'll sort them. For that, we need a sorting value. This value is the number of occurrences of each token.
Let's create a dictionary, linking each token to its number of occurrences. To do this, we will:
go through each token in
cleaned
check whether there's already a corresponding entry in the dictionary
either increment an existing value or create a new entry and set its value to 1.
occurrences = {}
for clean_token in cleaned:
if occurrences.get(clean_token):
occurrences[clean_token] += 1
else:
occurrences[clean_token] = 1
The get
method of a dictionary (like in occurrences.get(clean_token)
) returns the value of a key-value pair.
Here's a sample of our dictionary when we feed Moby Dick to the program:
///...///
'rates': 1,
'replenish': 1,
'reservoir': 3,
'good?': 1,
'hazards': 1,
'victory': 1,
'on—one': 1,
'serving': 2,
'pulsations': 1,
'truthfully': 1,
'obliterated': 1,
///...///
We chose a dictionary since token-occurrence pairs map well to a key-value structure. Tokens are the keys, and occurrences are the values. In the next section, we'll sort tokens by frequency of appearance. Because dictionaries are unordered data structures, we'll have to use lists again.
Sorting tokens by number of occurrences
We'll sort tokens in two distinct steps:
Create a list of sorted tokens without the token count.
Create a sorted list of token-count tuples.
Here's the first step:
sorted_tokens = sorted(occurrences, key=occurrences.get, reverse = True)
We start from the dictionary of occurrences
and use the values as a sorting key with occurrences.get
. This means that the tokens will appear by order of frequency.
Now, we'll use the sorted token list to associate each token with their count. The general structure would be [(token_string, occurrences), ...]
.
Here's the code:
sorted_occurrence_pairs = [(token, occurrences[token]) for token in sorted_tokens]
Let's see how our code fares against real-life data with formatted output. To fit the output in a small terminal window, I want to see only the first twenty most frequent tokens.
print("The twenty most frequent words are...")
for word, occ in sorted_occurrence_pairs[0:20]:
print(f"{word:<30} appeared {occ} times")
Remember how you can use a special-purpose mini-language to format strings. word:<30
means that the word
variable will align to the left of a 30-character wide field. Calling the program on the Moby Dick text file gives some reasonable-looking output.
% python3 my_wc.py moby_dick.txt
lines: 22314
words: 215838
The twenty most frequent words are...
the appeared 13872 times
of appeared 6671 times
and appeared 6070 times
to appeared 4582 times
a appeared 4550 times
in appeared 3963 times
that appeared 2829 times
his appeared 2456 times
it appeared 2023 times
I appeared 1840 times
with appeared 1707 times
is appeared 1700 times
was appeared 1627 times
as appeared 1608 times
he appeared 1522 times
for appeared 1419 times
all appeared 1412 times
this appeared 1278 times
at appeared 1238 times
by appeared 1160 times
Correcting a small bug during token normalization
Oops! I realized there's an "I" token in the last output.
I had forgotten to turn all the text to lowercase. This is important because we want a word <-> token equivalence. Let me explain with an example. Now, our code views "Cat", "cAt", "cat", "CAT", etc. as different tokens, each with its own occurrence count.
I was too lax during the normalization process. Normalization is when you turn tokens into a standard form.
It's a simple error to fix: when we create the list of cleaned
tokens, call the lower
method on tokens.
cleaned = []
for token in tokens:
cleaned_token = token.strip(',.;:\\()\'\"\t\n\r').lower()
if cleaned_token:
cleaned.append(cleaned_token)
Re-run the program to see if it yields different results. For example, how often does the word ‘the’ appear now?
Creating the word frequency histogram
We want to display the most frequent words in graphical form. A good first step is to draft a mockup. We'll have the most frequent words on top of each other, with a number and a bar to their right.
word_1 : count_1 | ****************
word_2 : count_2 | ***********
word_3 : count_3 | *******
... ... ...
word_20 : count_20 | *
Generating word frequencies
We won't be able to use raw word counts for the bar lengths.
For example, in our copy of Moby Dick, the word "the" appears 14,521 times. It won't fit on a screen as-is, so we will reduce them to a frequency.
twenty_freqs = [(token, occ/wc) for token, occ in sorted_occurrence_pairs[0:20]]
twenty_freqs
takes the twenty most frequent tokens and their occurrences. We divide each occurrence count by the total word count and bundle the result with the token. For our Moby Dick example, no word has a frequency greater than one.
[('the', 0.06727730983422753),
('of', 0.031092764017457537),
('and', 0.02968430026223371),
('a', 0.02160879919198658),
('to', 0.02154393572957496),
('in', 0.019380275947701517),
('that', 0.013375772570168367),
('his', 0.011642991502886424),
('it', 0.01057737747755261),
('i', 0.008524912202670522),
('with', 0.008163530054948619),
('but', 0.008066234861331184),
('as', 0.007992105190003615),
('is', 0.007955040354339828),
('he', 0.007922608623134018),
('was', 0.00758439199770198),
('for', 0.007473197490710626),
('all', 0.006773598717556686),
('this', 0.006412216569834784),
('at', 0.006111064780066532)]
You can't create a graph bar in a chart with a useless number like 0.007955040354339828. We'll turn the numbers into reasonable integers.
A low-effort transformation on our data is multiplying it by 1000 and rounding the result.
twenty_freqs = [(token, round(1000*occ/wc)) for token, occ in sorted_occurrence_pairs[0:20]]
Drawing a chart
The twenty_freqs
list of pairs doesn't give us all the info we need for display. Below is a rework that will be more useful.
twenty_freqs = [(token, occ, round(1000*occ/wc)) for token, occ in sorted_occurrence_pairs[0:20]]
As you see, we now have access to token
strings, their number of occurrences, and a bar length (round(1000*occ/wc)
).
def bar(bar_length):
bar_string = ''
for i in range(0, bar_length):
bar_string = bar_string + "*"
return bar_string
print("The twenty most frequent words are...")
for word, occ, bar_length in twenty_freqs:
print(f"{word:<5}: {occ}|", end='')
print(bar(bar_length))
Here's how the output looks like.
% python3 my_wc.py moby_dick.txt
lines: 22314
words: 215838
The twenty most frequent words are...
the : 14521|*******************************************************************
of : 6711|*******************************
and : 6407|******************************
a : 4664|**********************
to : 4650|**********************
in : 4183|*******************
that : 2887|*************
his : 2513|************
it : 2283|***********
i : 1840|*********
with : 1762|********
but : 1741|********
as : 1725|********
is : 1717|********
he : 1710|********
was : 1637|********
for : 1613|*******
all : 1462|*******
this : 1384|******
at : 1319|******
Refactoring the program
Our program works, but it's an ugly mess. It's time to refactor the code. Before we move on, it is a good idea to make sure we all have the same version of the program.
Command line arguments code
No real change for now.
import argparse
parser = argparse.ArgumentParser(
prog="my_wc.py",
description="word count and other file manipulation")
parser.add_argument("filename",
type=str,
help="filename")
args = parser.parse_args()
Reading the file
We did a lot of work inside the open
block. I like to do as little as possible while the file is in use. Our old code wasn't that bad, and our next version might even be less efficient, but I like keeping concerns separate. On one side, we do file I/O and only file I/O (reading inside an open block). On the other side, we process the data outside the open block.
with open(args.filename, "r") as f:
text = f.read()
The read
method reads the whole file at once, unlike readline
, which read only the next line.
Processing the data
Now, we have the text as a string in the text
variable. We will extract the number of lines, create a tokens
list, and calculate the number of words. It's now three easy lines of code. Much simpler!
lines = text.count('\n') # one newline char per line
tokens = text.split()
wc = len(tokens)
print(f"lines: {lines}\nwords: {wc}")
Token normalization
The code responsible for cleaning tokens can stay the same.
cleaned = []
for token in tokens:
cleaned_token = token.strip(',.;:\\()\'\"\t\n\r').lower()
if cleaned_token:
cleaned.append(cleaned_token)
We can also use a simple list comprehension to get the same results.
cleaned = [token.strip(',.;:\\()\'\"\t\n\r').lower() for token in tokens if token.strip(',.;:\\()\'\"\t\n\r').lower()]
Counting occurrences
Next, we can simplify the code counting token occurrences.
We'll use the get
method on the dictionary. This method can take an optional argument. This is the value to use as a new value when the first argument (the key) is not found. For example, occurrences.get(clean_token, 0)
means "either get the value associated with the clean_token
key or return 0
if the key's not found".
occurrences = {}
for clean_token in cleaned:
occurrences[clean_token] = occurrences.get(clean_token, 0) + 1
To recap, either add one to the existing number of occurrences for each cleaned token or set it to one.
Sorting by occurrence count
Before refactoring, we were using two distinct steps to do this:
Create a sorted list of occurrences with
sorted_tokens
.Create a sorted list of token-occurrences pairs with
sorted_frequency_pairs
.
We can tighten the code into one single step.
sorted_frequency_pairs = sorted(occurrences.items(), key=lambda x: x[1], reverse=True)
The items
method, when you call it on a dictionary (like occurrences
), returns a list-like object. It's not an actual list, but for all intents and purposes, you can treat it like one.
In our case, we can index into the items list to extract the occurrence count and use it as a key for the sort
function. In the code, this is key=lambda x: x[1]
.
Displaying the graph
Finally, here's the code for output.
twenty_rel_occurrences = [(token, freq, round(1000*freq/wc)) for token, freq in sorted_frequency_pairs[0:20]]
print("The twenty most frequent words are...")
for word, freq, bar_length in twenty_rel_occurrences:
print(f"{word:<5}: {freq:<6}|{'*'*bar_length}")
We can repeat a string by multiplying it by a number, like in "*" * bar_length
.
Saving results to a file
We started with files, so we'll end with files. Since we spent so much time creating a nicely formatted output, we'll want to keep it for later.
Specifying an output file
We have to ask the user for a file name to save our results in a file. You've already seen how to add a command line argument before. Let's adapt our code to make space for another one.
There's no change when creating the parser.
parser = argparse.ArgumentParser(
prog="my_wc.py",
description="word count and other file manipulation")
Here's the first command-line argument we created.
parser.add_argument("input_file",
type=str,
help="input file")
Now, we will do something very similar to get the output filename.
parser.add_argument("-o",
type=str,
required=False,
help="output file")
The behavior of this argument is different from the previous one. First, this will be optional, as the required
flag indicates. The "-o"
is a "prefix" for our filename. This is so that our parser knows that the string following it is the output arg.
You would call the program like this:
% python3 my_wc.py moby_dick.txt -o moby_report.txt
moby_dick.txt
is the input file, and moby_report.txt
is the output file.
As you can see, to add a new command line argument, you call the add_argument
method of the parser.
Finally, call parse_args
.
args = parser.parse_args()
Saving results
When referencing the input or output file, use args.input_file
or args.o
. In our program, that would be:
with open(args.input_file, "r") as f:
text = f.read()
We can create a report and save it later if needed:
report = f"lines: {lines}\nwords: {wc}\n"
report = report + "The twenty most frequent words are...\n"
for word, freq, bar_length in twenty_freqs:
report = report + f"{word:<5}: {freq:<6}|{'*'*bar_length}\n"
If the user specified an output file, save the report
to it; else, print it to the screen.
if args.o is not None:
with open(args.o, 'w') as f:
f.write(report)
else:
print(report)
We use "w"
as an argument to the open
function because we want to open the file in "write" mode. This contrasts how we used open in "read" mode before.
The write
method takes as an argument the data to write. Here, f.write(report)
means you write the report
to f
.
It's all done! We now have a working command-line program that has a tangible impact on the world (the world = your hard drive).
Recap
Files as streams
Python treats files as streams of data. The data can be bytes (not covered in this post) or characters.
A stream is like a conveyor belt. You access data in sequence rather than all at once.
The general idea is that you don't jump around inside a file's content. You only access it little by little, in order. The seek method lets you get around this limitation. We didn't need it for our project, and I didn't want to overcomplicate the post, but you can read about it here.
Streaming avoids memory problems when accessing large files. A good real-life example is when streaming video over an internet connection. You want to start watching your content as soon as possible rather than after a long download.
Common stream-like interfaces are files, network connections, etc.
Opening files
You must open files before using their contents.
The open
function takes the filename as a string argument. For example, open("my_file.txt")
.
Among several possible optional arguments, open takes a file mode argument:
"r"
to read a file"w"
to write to a file (it overwrites contents!)"a"
to append data to the end of a file"x"
to create a file"b"
to open a file containing binary data
After you open a file, you get a file object. This file object is useable for the file mode you specified. If you opened in read mode, you can't write to it.
You need to close each file you have opened. The basic way to do that is with the close
function, but there is a better way. Use the with open(...) as f:
idiom. You won't need to remember to close the file.
Reading from files
Once you've opened a file in read mode, there are several ways to read its contents.
The
.readline()
method reads a single line, including its trailing\n
character.Looping:
for line in f:
goes over each line in order, provided you named the file objectf
.The
.readlines()
method returns a list of all lines.The
.read()
method returns the whole file as a string.
Which method you use depends on the particular use case. In this post, we first used a looping construct. When the loop's body became too fat, we switched to reading the whole file and moved the logic outside the loop.
Writing to files
Writing to files is straightforward: use the write
method on a file object. It takes the string you want to write to the file as an argument.
If you combine several lines, make sure they include a new line. Otherwise, they will become one single enormous line.
Command-line arguments
A command-line program is a program you launch inside and from a terminal window. It works in text mode. It would be too big of a tangent if I explained it in detail here, but you can read about it here if you want to know more.
We used the argparse
module to pass arguments to the program through the command line. It's part of Python's standard library, so you don't need to install anything.
Here's the three-step workflow we used:
Create a parser.
Add arguments to the parser.
Store the arguments in a variable.
We create a parser using the argparse.ArgumentParser
function. During this step, you can specify info about your program, such as the program's name or a description.
Add arguments to the parser, one by one, with the add_argument
method. Its first argument is the argument name (for example, in our program, "input_file"
and "-o"
). Optional arguments usually start with -
and are not required by default. You can make them required by setting the required=True
flag.
Finally, we call the parse_args
method on the parser and store the results in a variable. If we name this variable args
, we can access specific command-line arguments by their name. For example, args.arg_1
, args.arg_two
, args.arg_3
, etc.
Text tokenization
We call tokenization the act of subdividing a string into useable substrings. In our example, we created a list of words with the split
method. This method takes a character as an argument. For example, split(' ')
splits on spaces, split('\t')
splits on tabs, etc. If you don't provide any argument, you get the default behavior of splitting on all whitespace.
It can be useful to clean up your tokens after splitting. For example, by stripping them of unwanted characters with strip
or converting to lowercase with lower
.
Exercises
Count characters: Adapt the program to count how many characters are in the file. The number should include whitespace and punctuation. Print that number.
Show average words per line: Compute the average number of words per line and print it. Calculate the result using the line count and the total number of words.
Display the longest line: Find the longest line in the file (highest number of characters) and print it.
Add a --top N argument: Change the program to accept an optional command-line argument. It will hold the number of the most frequent words to display in the histogram and final report. Good argument names to use are
--top
or-t
. Refer to theargparse
docs to learn how command-line arguments can have a default value.Ignore common short words: Add a filter to ignore words shorter than 3 characters. How does this change the top 20 list?
Include punctuation in frequency counts: What if you don’t clean the tokens? Try printing a top-20 list without stripping or lowercasing. How messy does it get?
Filter out common “stop words”: Create a small list of common words (like “the”, “and”, “a”, “of”, etc.) and exclude them from your top-20 list. How does that affect the results?
Compare two files: Write a program version that takes two input files. It will compare which words are the most frequent in those files.
Show a reverse histogram: Print the bar chart in descending rows with the most frequent word at the bottom. Can you do it by reversing your loop?
Add a search feature: Add a new argument like
--search some_word
that prints how often a particular word appears in the text.
Subscribe to my newsletter
Read articles from Had Will directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
