Tutorial: Python File I/O, Text Processing, and CLI Arguments

Hello! This post is part of my Learn Python by doing projects tutorial series.

In this Python tutorial, you’ll build a command-line word counter app. We will extend it to report on word frequencies. Along the way, you’ll learn how to take arguments from the command line using the argparse module. You will also learn file I/O (open files, read from them and write to them).

If you're new to this series, you can start from the beginning.

What you’ll learn today

Use file I/O to read from and write to files in Python
Understand file streams and how to access data line-by-line or all at once
Use the with open(...) as f: idiom to manage file resources
Handle command-line arguments with the built-in argparse module
Split, clean, and normalize text using tokenization techniques
Build and update dictionaries to count word frequencies
Visualize data using text-based bar charts
Normalize word frequencies and apply fundamental data transformations
Refactor a growing program into clean, readable components

I have structured the project in three parts:

We'll first recreate a classic computer program to count the words and lines in a text file. If you use a Linux or Mac OS computer, you'll find this program by typing wc in your terminal.
Next, we'll go one step further and find the number of occurrences of the 20 most used words in the file. We'll even visualize that with a simple text-based bar chart!
Finally, we'll save our results as a report in a file.

Let’s go!

What's a file?

The question seems obvious. Everyone who has looked at a computer has heard about files. However, we need a more precise mental model since we will be working with them.

Files are ways to store data. You can close all your programs and shut down your computer. Your files will still be there when you reboot.

This is in contrast to data structures internal to your programs. Let's say you have a bunch of variables inside your program. Their "state" (i.e., their data) will disappear after your program exits.

Files are how you make data permanent.

In Python, files are "streams" of data. A stream is like a conveyor belt: you get the data sequentially. This contrasts with most other data structures, where you load everything at once in the RAM. For example, when you work with a string or a list, the whole thing is accessible. You don't have to go through list items 0 to 10 to access the eleventh item. Files are different. By default, you read or write to a file little by little (byte by byte, character by character, line by line, etc).

You might ask yourself why this matters. Sometimes, files are too big to load at once. For example, if you want to stream a 4K movie, you don't need the whole film, only the next images and sounds. You can watch the movies as they download instead of waiting. That's the gist of why we use streams.

First steps with files

First, let's create a simple text file to try some simple file manipulation.

-----BEGIN input_file.txt-------
Hello, world!
This is a test input!
-----END input_file.txt---------

The open function takes many arguments, but we'll use two in our code:

a filename: a string, e.g., "my_filename.txt"
an optional mode: a code how you're opening the file. For example, 'r' is for 'read mode', 'w' is for 'write mode', etc.

Let's see how we can open our little test file. Launch Python in the directory/filepath/folder where you put your file.

>>> f = open('input_file.txt')

>>> f
<_io.TextIOWrapper name='input_file.txt' mode='r' encoding='UTF-8'>

We opened our file and stored the result in the f variable. When we examine f, we see a somewhat opaque TextIOWrapper type. f is a file object. By default, Python created it in "read" mode. Several reading methods exist. For example, f.readline() reads a file line by line.

>>> f.readline()
'Hello, world!\n'

>>> f.readline()
'This is a test input!\n'

When we reach the end of the file, f.readline() returns an empty string.

>>> f.readline()
''

As you can see, calling readline repeatedly gets you successive lines. This illustrates the streaming nature of files.

We close a file object using the close method.

>>> f.close()

So now we have a workflow to deal with files:

opening a file
doing something to it (e.g., reading lines)
closing the file

This approach has flaws. The most glaring one is that humans are messy. They always forget to clean up after themselves and won't close their files after using them.

While this may seem benign, it will waste your system resources or cause errors. Speaking of errors, what if your program fails before you close the file? The close method might never run.

The standard Python approach to dealing with files is the "with open as" idiom. Look at how I read the two first lines in our input_file.txt.

with open('input_file.txt') as f:
    f.readline()
    f.readline()

The first line means we create a f variable holding a file object, which closes as soon as the inner block of code ends.

This will be our bread-and-butter approach to file handling from now on.

A simple word counter

I'll first create a testbed toy program, and we'll refine it later. It'll

go over each line of the file
split it into a list of words
count the words in the list
sum up each line's word count

"""
my_wc -- A word counter program
"""

wc = 0

with open("input_file.txt", "r") as f:
    for line in f:
        word_list = line.split()
        word_list_len = len(word_list)
        wc += word_list_len

print(wc)

Let's go through this line by line.

First, we create a word count variable, wc. Since we haven't started counting, we initialize it with 0.

The next line is in the "with open as" idiom we saw earlier. I hard-coded the filename for now. The "r" argument means we're opening the file in "read-only" mode. This parameter is not 100% necessary, as "r" is the implied default for open. Keeping code explicit is a good idea, so I included it here.

The body of the open block is a for loop. When looping over a text file, the code iterates line by line.

The next three lines each do one simple thing:

Split the line into words using whitespace as a delimiter. This creates the word_list list.
Measure the number of words. This is the length of the word_list, that I call word_list_len.
Add the number of words to the counter.

Now that we've seen how it works, we can make the loop body more concise:

wc = len(line.split())

This example shows you can read a file without using the readline method. It's enough to loop through it and call the split method for each line.

Command-line arguments

I coded the name of the file in the source code. This is fine when exploring a simple concept, but this made the program useless. You want to be able to use the program with whatever text file you have.

Enter command-line arguments.

Let's say you have named the program my_wc.py. You would invoke it from the command line like this:

% python3 my_wc.py

Command line arguments are extra arguments you pass to a program when you invoke it. Like this:

% python3 my_wc.py input_file.txt

Here, input_file.txt is the argument.

There are several different possible solutions for command-line arguments. I'll show you the recommended standard.

The argparse module

The argparse module is part of Python's built-in modules.

It has a 3-part workflow:

Create a "parser."
Add arguments to your parser
Use the parser

argparse workflow

I'll take time to define parsers, as you'll probably come across them later. Sometimes, a program has to make sense of some text. For example, the Python interpreter has to understand the programs' structure to run them. A parser might not seem useful if we have only one command line argument (the filename) in our program. We'll add some more arguments later, though.

Importing argparse and creating the parser

import argparse

parser = argparse.ArgumentParser(
    prog="my_wc.py",
    description="word count and other file manipulation")

When creating a parser, we only make some scaffolding. ArgumentParser's arguments are metadata for the program. e.g., the program's name, a description, etc. You can check out all the arguments here.

Adding an argument to the parser

Now that we've got scaffolding in place in the form of the <parser> variable, we'll add a new argument.

parser.add_argument("filename",
                    type=str,
                    help="filename")

This means we now run our program from the command line like this.

% my_wc.py myfilename.txt

Python will understand that myfilename.txt is our parser's "filename" argument. type=str means that we'll treat the argument as a string. Here are the possible arguments to the add_argument method.

Using the parser

Now that our parser is in place, we will use it.

args = parser.parse_args()

wc = 0

with open(args.filename, "r") as f:
    for line in f:
        word_list = line.split(' ')
        word_list_len = len(word_list)
        wc += word_list_len

We call the parse_args method on our parser. The results are the arguments the user gave when launching the program. We can store them in a variable. For example, args.filename is now "my_filename.txt".

Let's test it.

% python3 my_wc.py input_file.txt
7

A nice side effect of using argparse is that it generates help pages.

% python3 my_wc.py -h

usage: my_wc.py [-h] filename
word count and other file manipulation

positional arguments:
  filename    filename

options:
  -h, --help  show this help message and exit

Using real-life data

We have used a small two-line text file. It's now time to pitch our code against real-world data. We'll see if our code is slow or if there's any strange case we haven't considered.

I'll use the text from Moby Dick that you can download from here.

If you're on Linux or Mac OS, you can download the ebook from the command line like this:

% curl https://www.gutenberg.org/cache/epub/2701/pg2701.txt -o moby_dick.txt

Our program is trying to mimic the Unix wc utility. If you're using a Linux or Mac OS computer, you can double-check our program's results. For example, the two commands below should give a similar number of words.

% wc moby_dick.txt
   22314  215838 1276288 moby_dick.txt

wc outputs several numbers. Here, you must look at the second number for the word count (215838). Now to our program:

% python3 my_wc.py moby_dick.txt

220156

Huh? We end up with a higher word count than our reference implementation.

Can you find the bug?

While writing the program, I had a nagging feeling that something was wrong and would come back to bite me. You see, I split each line with word_list = line.split(' '). I used spaces as a separator. Consider the example below. First, I create a string containing whitespace (spaces and tabs) and a few lone letters.

>>> whitespace_string = "\t \t \t a a a"

The number of 'words' (I'm using a loose definition of a word here) is three: the three a's. But if we split the string using spaces as separators, we get a different picture. Tabs are on the same level as letters!

>>> split_1 = whitespace_string.split(' ')

>>> split_1
['\t', '\t', '\t', 'a', 'a', 'a']

We need to use all and every whitespace as a separator:

>>> split_2 = whitespace_string.split()

>>> split_2
['a', 'a', 'a']

The split method without arguments does a good job. Let's include it in the program: word_list = line.split().

Try it. Now it works!

This bug shows the importance of good-quality testing material. We didn't catch the bug when we only had a homebrew two-line test file. Things changed when we used the whole Moby Dick text.

Adding a line counter

We'll add line counting since we now have a stable word counter.

We don't have to change much. When reading the file, we'll only add a count variable and increment it at each loop iteration.

lines = 0

wc = 0

with open(args.filename, "r") as f:
    for line in f:
        lines += 1
        word_list = line.split()
        word_list_len = len(word_list)
        wc += word_list_len

print(f"lines: {lines}\nwords: {wc}")

% python3 my_wc.py moby_dick.txt

lines: 22314
words: 215838

Counting the most frequent words

Generating a token list

Our program is already useful, but I want to go beyond that. I want to plot the most frequent words in a text. For that, we must first turn our input text into discrete words. The jargon term for this is tokenization. A token is a unit of text resulting from splitting a sequence of text. Tokens can be words, subwords, characters, etc. In our case, tokens are words. For example, the sentence "The green cat eats the purple rat." gives the token list ["The", "green", "cat", "eats", "the", "purple", "cat"].

Here are the relevant parts of our code.

tokens = []

with open(args.filename, "r") as f:
    for line in f:
        lines += 1
        word_list = line.split()
        word_list_len = len(word_list)
        wc += word_list_len
        tokens += word_list

And here's what <word_list> looks like if we call our program on its source code. (the program output below is cropped)

['"""', 'wc', 'word', 'frequency', '"""', 'import', 'argparse',

///...///

Cleaning the tokens

As you can see, some tokens don't look like words at all ("" for example). This poses two problems

we get non-words into the list ("")
some words could have several forms (capitalized vs non-capitalized, for example).

We want to purge the tokens list from unwanted characters.

cleaned = []

for token in tokens:
    cleaned_token = token.strip(',.;:\\()\'\"\t\n\r')
    if cleaned_token:
        cleaned.append(cleaned_token)

This block of code

goes through each token in the list
strips it of unwanted characters
check if there are still some characters left after stripping (if cleaned_token:)
if so, add the clean token to the cleaned list

The strip method copies a string and removes the characters in an arguments string. Here, we remove commas, periods, colons, semi-colons, backslashes, parentheses, and single and double quotes.

Below is the list of cleaned tokens from calling our program using its source code.

['wc', 'word', 'frequency', 'import', 'argparse', 'import', 'pprint', 'parser',

///...///

Counting occurrences for each token

Now, our list of tokens is somewhat clean. Next, we'll sort them. For that, we need a sorting value. This value is the number of occurrences of each token.

Let's create a dictionary, linking each token to its number of occurrences. To do this, we will:

go through each token in cleaned
check whether there's already a corresponding entry in the dictionary
either increment an existing value or create a new entry and set its value to 1.

occurrences = {}

for clean_token in cleaned:
    if occurrences.get(clean_token):
        occurrences[clean_token] += 1
    else:
        occurrences[clean_token] = 1

The get method of a dictionary (like in occurrences.get(clean_token)) returns the value of a key-value pair.

Here's a sample of our dictionary when we feed Moby Dick to the program:

///...///

'rates': 1,
'replenish': 1,
'reservoir': 3,
'good?': 1,
'hazards': 1,
'victory': 1,
'on—one': 1,
'serving': 2,
'pulsations': 1,
'truthfully': 1,
'obliterated': 1,

///...///

We chose a dictionary since token-occurrence pairs map well to a key-value structure. Tokens are the keys, and occurrences are the values. In the next section, we'll sort tokens by frequency of appearance. Because dictionaries are unordered data structures, we'll have to use lists again.

Sorting tokens by number of occurrences

We'll sort tokens in two distinct steps:

Create a list of sorted tokens without the token count.
Create a sorted list of token-count tuples.

Here's the first step:

sorted_tokens = sorted(occurrences, key=occurrences.get, reverse = True)

We start from the dictionary of occurrences and use the values as a sorting key with occurrences.get. This means that the tokens will appear by order of frequency.

Now, we'll use the sorted token list to associate each token with their count. The general structure would be [(token_string, occurrences), ...].

Here's the code:

sorted_occurrence_pairs = [(token, occurrences[token]) for token in sorted_tokens]

Let's see how our code fares against real-life data with formatted output. To fit the output in a small terminal window, I want to see only the first twenty most frequent tokens.

print("The twenty most frequent words are...")

for word, occ in sorted_occurrence_pairs[0:20]:
    print(f"{word:<30} appeared {occ} times")

Remember how you can use a special-purpose mini-language to format strings. word:<30 means that the word variable will align to the left of a 30-character wide field. Calling the program on the Moby Dick text file gives some reasonable-looking output.

% python3 my_wc.py moby_dick.txt

lines: 22314
words: 215838

The twenty most frequent words are...
the                            appeared 13872 times
of                             appeared 6671 times
and                            appeared 6070 times
to                             appeared 4582 times
a                              appeared 4550 times
in                             appeared 3963 times
that                           appeared 2829 times
his                            appeared 2456 times
it                             appeared 2023 times
I                              appeared 1840 times
with                           appeared 1707 times
is                             appeared 1700 times
was                            appeared 1627 times
as                             appeared 1608 times
he                             appeared 1522 times
for                            appeared 1419 times
all                            appeared 1412 times
this                           appeared 1278 times
at                             appeared 1238 times
by                             appeared 1160 times

Correcting a small bug during token normalization

Oops! I realized there's an "I" token in the last output.

I had forgotten to turn all the text to lowercase. This is important because we want a word <-> token equivalence. Let me explain with an example. Now, our code views "Cat", "cAt", "cat", "CAT", etc. as different tokens, each with its own occurrence count.

I was too lax during the normalization process. Normalization is when you turn tokens into a standard form.

It's a simple error to fix: when we create the list of cleaned tokens, call the lower method on tokens.

cleaned = []

for token in tokens:
    cleaned_token = token.strip(',.;:\\()\'\"\t\n\r').lower()
    if cleaned_token:
        cleaned.append(cleaned_token)

Re-run the program to see if it yields different results. For example, how often does the word ‘the’ appear now?

Creating the word frequency histogram

We want to display the most frequent words in graphical form. A good first step is to draft a mockup. We'll have the most frequent words on top of each other, with a number and a bar to their right.

word_1  : count_1  | ****************
word_2  : count_2  | ***********
word_3  : count_3  | *******
...       ...        ...
word_20 : count_20 | *

Generating word frequencies

We won't be able to use raw word counts for the bar lengths.

For example, in our copy of Moby Dick, the word "the" appears 14,521 times. It won't fit on a screen as-is, so we will reduce them to a frequency.

twenty_freqs = [(token, occ/wc) for token, occ in sorted_occurrence_pairs[0:20]]

twenty_freqs takes the twenty most frequent tokens and their occurrences. We divide each occurrence count by the total word count and bundle the result with the token. For our Moby Dick example, no word has a frequency greater than one.

[('the', 0.06727730983422753),
 ('of', 0.031092764017457537),
 ('and', 0.02968430026223371),
 ('a', 0.02160879919198658),
 ('to', 0.02154393572957496),
 ('in', 0.019380275947701517),
 ('that', 0.013375772570168367),
 ('his', 0.011642991502886424),
 ('it', 0.01057737747755261),
 ('i', 0.008524912202670522),
 ('with', 0.008163530054948619),
 ('but', 0.008066234861331184),
 ('as', 0.007992105190003615),
 ('is', 0.007955040354339828),
 ('he', 0.007922608623134018),
 ('was', 0.00758439199770198),
 ('for', 0.007473197490710626),
 ('all', 0.006773598717556686),
 ('this', 0.006412216569834784),
 ('at', 0.006111064780066532)]

You can't create a graph bar in a chart with a useless number like 0.007955040354339828. We'll turn the numbers into reasonable integers.

A low-effort transformation on our data is multiplying it by 1000 and rounding the result.

twenty_freqs = [(token, round(1000*occ/wc)) for token, occ in sorted_occurrence_pairs[0:20]]

Drawing a chart

The twenty_freqs list of pairs doesn't give us all the info we need for display. Below is a rework that will be more useful.

twenty_freqs = [(token, occ, round(1000*occ/wc)) for token, occ in sorted_occurrence_pairs[0:20]]

As you see, we now have access to token strings, their number of occurrences, and a bar length (round(1000*occ/wc)).

def bar(bar_length):
    bar_string = ''
    for i in range(0, bar_length):
        bar_string = bar_string + "*"
    return bar_string

print("The twenty most frequent words are...")

for word, occ, bar_length in twenty_freqs:
    print(f"{word:<5}: {occ}|", end='')
    print(bar(bar_length))

Here's how the output looks like.

% python3 my_wc.py moby_dick.txt
lines: 22314
words: 215838
The twenty most frequent words are...
the  : 14521|*******************************************************************
of   : 6711|*******************************
and  : 6407|******************************
a    : 4664|**********************
to   : 4650|**********************
in   : 4183|*******************
that : 2887|*************
his  : 2513|************
it   : 2283|***********
i    : 1840|*********
with : 1762|********
but  : 1741|********
as   : 1725|********
is   : 1717|********
he   : 1710|********
was  : 1637|********
for  : 1613|*******
all  : 1462|*******
this : 1384|******
at   : 1319|******

Refactoring the program

Our program works, but it's an ugly mess. It's time to refactor the code. Before we move on, it is a good idea to make sure we all have the same version of the program.

Command line arguments code

No real change for now.

import argparse

parser = argparse.ArgumentParser(
    prog="my_wc.py",
    description="word count and other file manipulation")

parser.add_argument("filename",
                    type=str,
                    help="filename")

args = parser.parse_args()

Reading the file

We did a lot of work inside the open block. I like to do as little as possible while the file is in use. Our old code wasn't that bad, and our next version might even be less efficient, but I like keeping concerns separate. On one side, we do file I/O and only file I/O (reading inside an open block). On the other side, we process the data outside the open block.

with open(args.filename, "r") as f:
    text = f.read()

The read method reads the whole file at once, unlike readline, which read only the next line.

Processing the data

Now, we have the text as a string in the text variable. We will extract the number of lines, create a tokens list, and calculate the number of words. It's now three easy lines of code. Much simpler!

lines = text.count('\n')  # one newline char per line
tokens = text.split()
wc = len(tokens)
print(f"lines: {lines}\nwords: {wc}")

Token normalization

The code responsible for cleaning tokens can stay the same.

cleaned = []

for token in tokens:
    cleaned_token = token.strip(',.;:\\()\'\"\t\n\r').lower()
    if cleaned_token:
        cleaned.append(cleaned_token)

We can also use a simple list comprehension to get the same results.

cleaned = [token.strip(',.;:\\()\'\"\t\n\r').lower() for token in tokens if token.strip(',.;:\\()\'\"\t\n\r').lower()]

Counting occurrences

Next, we can simplify the code counting token occurrences.

We'll use the get method on the dictionary. This method can take an optional argument. This is the value to use as a new value when the first argument (the key) is not found. For example, occurrences.get(clean_token, 0) means "either get the value associated with the clean_token key or return 0 if the key's not found".

occurrences = {}

for clean_token in cleaned:
    occurrences[clean_token] = occurrences.get(clean_token, 0) + 1

To recap, either add one to the existing number of occurrences for each cleaned token or set it to one.

Sorting by occurrence count

Before refactoring, we were using two distinct steps to do this:

Create a sorted list of occurrences with sorted_tokens.
Create a sorted list of token-occurrences pairs with sorted_frequency_pairs.

We can tighten the code into one single step.

sorted_frequency_pairs = sorted(occurrences.items(), key=lambda x: x[1], reverse=True)

The items method, when you call it on a dictionary (like occurrences), returns a list-like object. It's not an actual list, but for all intents and purposes, you can treat it like one.

In our case, we can index into the items list to extract the occurrence count and use it as a key for the sort function. In the code, this is key=lambda x: x[1].

Displaying the graph

Finally, here's the code for output.

twenty_rel_occurrences = [(token, freq, round(1000*freq/wc)) for token, freq in sorted_frequency_pairs[0:20]]

print("The twenty most frequent words are...")

for word, freq, bar_length in twenty_rel_occurrences:
    print(f"{word:<5}: {freq:<6}|{'*'*bar_length}")

We can repeat a string by multiplying it by a number, like in "*" * bar_length.

Saving results to a file

We started with files, so we'll end with files. Since we spent so much time creating a nicely formatted output, we'll want to keep it for later.

Specifying an output file

We have to ask the user for a file name to save our results in a file. You've already seen how to add a command line argument before. Let's adapt our code to make space for another one.

There's no change when creating the parser.

parser = argparse.ArgumentParser(
    prog="my_wc.py",
    description="word count and other file manipulation")

Here's the first command-line argument we created.

parser.add_argument("input_file",
                    type=str,
                    help="input file")

Now, we will do something very similar to get the output filename.

parser.add_argument("-o",
                    type=str,
                    required=False,
                    help="output file")

The behavior of this argument is different from the previous one. First, this will be optional, as the required flag indicates. The "-o" is a "prefix" for our filename. This is so that our parser knows that the string following it is the output arg.

You would call the program like this:

% python3 my_wc.py moby_dick.txt -o moby_report.txt

moby_dick.txt is the input file, and moby_report.txt is the output file.

As you can see, to add a new command line argument, you call the add_argument method of the parser.

Finally, call parse_args.

args = parser.parse_args()

Saving results

When referencing the input or output file, use args.input_file or args.o. In our program, that would be:

with open(args.input_file, "r") as f:
    text = f.read()

We can create a report and save it later if needed:

report = f"lines: {lines}\nwords: {wc}\n"

report = report + "The twenty most frequent words are...\n"

for word, freq, bar_length in twenty_freqs:
    report = report + f"{word:<5}: {freq:<6}|{'*'*bar_length}\n"

If the user specified an output file, save the report to it; else, print it to the screen.

if args.o is not None:
    with open(args.o, 'w') as f:
        f.write(report)
else:
    print(report)

We use "w" as an argument to the open function because we want to open the file in "write" mode. This contrasts how we used open in "read" mode before.

The write method takes as an argument the data to write. Here, f.write(report) means you write the report to f.

It's all done! We now have a working command-line program that has a tangible impact on the world (the world = your hard drive).

Recap

Files as streams

Python treats files as streams of data. The data can be bytes (not covered in this post) or characters.

A stream is like a conveyor belt. You access data in sequence rather than all at once.

The general idea is that you don't jump around inside a file's content. You only access it little by little, in order. The seek method lets you get around this limitation. We didn't need it for our project, and I didn't want to overcomplicate the post, but you can read about it here.

Streaming avoids memory problems when accessing large files. A good real-life example is when streaming video over an internet connection. You want to start watching your content as soon as possible rather than after a long download.

Common stream-like interfaces are files, network connections, etc.

Opening files

You must open files before using their contents.

The open function takes the filename as a string argument. For example, open("my_file.txt").

Among several possible optional arguments, open takes a file mode argument:

"r" to read a file
"w" to write to a file (it overwrites contents!)
"a" to append data to the end of a file
"x" to create a file
"b" to open a file containing binary data

After you open a file, you get a file object. This file object is useable for the file mode you specified. If you opened in read mode, you can't write to it.

You need to close each file you have opened. The basic way to do that is with the close function, but there is a better way. Use the with open(...) as f: idiom. You won't need to remember to close the file.

Reading from files

Once you've opened a file in read mode, there are several ways to read its contents.

The .readline() method reads a single line, including its trailing \n character.
Looping: for line in f: goes over each line in order, provided you named the file object f.
The .readlines() method returns a list of all lines.
The .read() method returns the whole file as a string.

Which method you use depends on the particular use case. In this post, we first used a looping construct. When the loop's body became too fat, we switched to reading the whole file and moved the logic outside the loop.

Writing to files

Writing to files is straightforward: use the write method on a file object. It takes the string you want to write to the file as an argument.

If you combine several lines, make sure they include a new line. Otherwise, they will become one single enormous line.

Command-line arguments

A command-line program is a program you launch inside and from a terminal window. It works in text mode. It would be too big of a tangent if I explained it in detail here, but you can read about it here if you want to know more.

We used the argparse module to pass arguments to the program through the command line. It's part of Python's standard library, so you don't need to install anything.

Here's the three-step workflow we used:

Create a parser.
Add arguments to the parser.
Store the arguments in a variable.

We create a parser using the argparse.ArgumentParser function. During this step, you can specify info about your program, such as the program's name or a description.

Add arguments to the parser, one by one, with the add_argument method. Its first argument is the argument name (for example, in our program, "input_file" and "-o"). Optional arguments usually start with - and are not required by default. You can make them required by setting the required=True flag.

Finally, we call the parse_args method on the parser and store the results in a variable. If we name this variable args, we can access specific command-line arguments by their name. For example, args.arg_1, args.arg_two, args.arg_3, etc.

Text tokenization

We call tokenization the act of subdividing a string into useable substrings. In our example, we created a list of words with the split method. This method takes a character as an argument. For example, split(' ') splits on spaces, split('\t') splits on tabs, etc. If you don't provide any argument, you get the default behavior of splitting on all whitespace.

It can be useful to clean up your tokens after splitting. For example, by stripping them of unwanted characters with strip or converting to lowercase with lower.

Exercises

Count characters: Adapt the program to count how many characters are in the file. The number should include whitespace and punctuation. Print that number.
Show average words per line: Compute the average number of words per line and print it. Calculate the result using the line count and the total number of words.
Display the longest line: Find the longest line in the file (highest number of characters) and print it.
Add a --top N argument: Change the program to accept an optional command-line argument. It will hold the number of the most frequent words to display in the histogram and final report. Good argument names to use are --top or -t. Refer to the argparse docs to learn how command-line arguments can have a default value.
Ignore common short words: Add a filter to ignore words shorter than 3 characters. How does this change the top 20 list?
Include punctuation in frequency counts: What if you don’t clean the tokens? Try printing a top-20 list without stripping or lowercasing. How messy does it get?
Filter out common “stop words”: Create a small list of common words (like “the”, “and”, “a”, “of”, etc.) and exclude them from your top-20 list. How does that affect the results?
Compare two files: Write a program version that takes two input files. It will compare which words are the most frequent in those files.
Show a reverse histogram: Print the bar chart in descending rows with the most frequent word at the bottom. Can you do it by reversing your loop?
Add a search feature: Add a new argument like --search some_word that prints how often a particular word appears in the text.

< Previous Post

Next Post >

Learn Python by Doing Projects - Part 6: CLI with argparse, file I/O

Table of contents

What you’ll learn today

What's a file?

First steps with files

A simple word counter

Command-line arguments

The argparse module

Importing argparse and creating the parser

Adding an argument to the parser

Using the parser

Using real-life data

Adding a line counter

Counting the most frequent words

Generating a token list

Cleaning the tokens

Counting occurrences for each token

Sorting tokens by number of occurrences

Correcting a small bug during token normalization

Creating the word frequency histogram

Generating word frequencies

Drawing a chart

Refactoring the program

Command line arguments code

Reading the file

Processing the data

Token normalization

Counting occurrences

Sorting by occurrence count

Displaying the graph

Saving results to a file

Specifying an output file

Saving results

Recap

Files as streams

Opening files

Reading from files

Writing to files

Command-line arguments

Text tokenization

Exercises

Subscribe to my newsletter

Had Will

Had Will