Build a Summarization POC in 5 Minutes

KevinKevin
5 min read

Let's build a summarization tool in five minutes. Seriously.

We are at a point where it's super simple to get started using AI/ ML in your projects. Cloud providers and open source make it possible. Hugging Face lets you quickly try pre-trained models by a particular use case.

We'll use BART from Meta via the transformers library from Hugging Face (follow the gif above).

Let's set up a Python virtual environment. Run the following commands in your terminal:

python3 -m venv .venv

Activate the virtual environment with:

source .venv/bin/activate

Install the required dependencies by running:

pip install torch transformers

Note: you may have to use "python" instead of python3 and if you are on Windows you may need to activate your environment with .venv\Scripts\activate

Create a file called app.py and paste the script we copied from Hugging Face (included below for reference).

from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

Run the script.

python3 app.py

You should see output similar to the following:

$ python3 main.py 
[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

Great, the sample script works on our local computer 🙂. You now have summarization at your fingertips!

Now what if someone has a file they'd like to summarize? We can support that quite easily. Let's add a quick CLI so you can show this to someone and not feel terrible.

Install a few more dependencies for the CLI and file processing.

pip install click PyPDF2 python-docx

The updated script below uses the package click to accept command line arguments. Click is a package for creating CLIs in a composable way with as little code as necessary. In this case, we'll accept string arguments for plain text input or a file path to a document to summarize. For now, the script will only handle PDF, Word documents, or plain text files. The program will limit the summary to between 30 and 140 words and print it to the console.

from transformers import pipeline
from pathlib import Path
import click
import PyPDF2
import docx
import os


@click.command()
@click.option("--file", default=None, help="A file path to a document to summarize")
@click.option("--text", default=None, help="Text to summarize.")
def run_summarizer(file, text):
    """Simple program to summarize text"""
    try:
        input_text = text if text else input_file_to_text(file)

        summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

        output = summarizer(input_text, max_length=140, min_length=30, do_sample=False)

        if not output:
            raise IndexError("No output generated")

        result = output[0]["summary_text"]

        click.echo(f"\nSummary:\n{result}\n")
    except Exception as e:
        click.echo(f"There was an error generating a summary: {str(e)}")


def input_file_to_text(filepath) -> str:
    """
    Convert a file to text
    """
    if not filepath:
        raise ValueError("No file provided")
    if not os.path.exists(filepath):
        raise ValueError("File does not exist")

    file_extension = Path(filepath).suffix.lower()

    if file_extension == ".pdf":
        return pdf_to_text(filepath)
    elif file_extension == ".docx":
        return docx_to_text(filepath)
    elif file_extension == ".txt":
        return txt_to_text(filepath)
    else:
        raise ValueError(f"Unsupported file format: {file_extension}")


def pdf_to_text(file_path):
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
    return text


def docx_to_text(file_path):
    doc = docx.Document(file_path)
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text


def txt_to_text(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        return file.read()


if __name__ == "__main__":
    run_summarizer()

We can test the changes with a file. I used the first page of one of my favorite books you can download the file from GitHub here.

Run the script with the --file flag and a path to your file.

python3 app.py --file candide-1.pdf

If you used the same file you should see similar output.

$ python3 app.py --file candide-1.pdf

Summary:
Candide was brought up in the castle of the most noble Baronof Thunder–ten–tronckh. He was the son of the Baron’s sister, by avery good sort of a gentleman of the neighborhood, whom that younglady refused to marry, because he could produce no more thanthreescore and eleven quarterings in his arms.

You'll notice the summarization is not perfect. This is where the real work begins. A POC like this facilitates starting conversations, asking questions, and collaborating with others to determine what is useful and valuable to have as features of a product or solution.

Any feedback is welcome to help me improve these articles 😊.

0
Subscribe to my newsletter

Read articles from Kevin directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kevin
Kevin