Building a Word Frequency Counter in Java: A Step-by-Step Guide
Introduction
The "Word Frequency Counter" is a Java project that allows you to analyze a given text and determine the frequency of each word it contains. This project serves as an excellent introduction to Java programming concepts, including string manipulation, arrays, and basic data structures.
Problem Statement
When working with textual data, understanding word frequency can be crucial for various applications, from text analysis to content optimization. The problem we aim to solve in this project is to create a Java application that takes a text input, analyzes it, and displays the frequency of each unique word present in the text.
Algorithm
The Word Frequency Counter project involves several key steps:
User Input: We prompt the user to input a text document. The user can enter multiple lines of text, and we concatenate them into a single string.
Tokenization: We tokenize the input text into words and remove punctuation. This step involves regular expression manipulation to split the text into individual words.
Word Frequency Count: We count the occurrences of each unique word in the input text. We maintain two arrays, one for unique words and another for word counts.
Sorting: To display the word frequencies in descending order, we implement a basic sorting algorithm. We iterate through the arrays, comparing word counts and swapping elements as needed.
Display: Finally, we display the word frequencies, showing each word and its count in descending order.
Pseudocode
Initialize an empty string inputText
While user input is not 'exit':
Read a line of text from the user and append it to inputText
Tokenize inputText into an array of words and remove punctuation
Initialize arrays: uniqueWords, wordCounts, uniqueWordCount = 0
For each word in words:
If word is not in uniqueWords:
Add word to uniqueWords
Initialize wordCounts[uniqueWordCount] to 1
Increment uniqueWordCount
Else:
Increment the corresponding word count in wordCounts
Sort uniqueWords and wordCounts in descending order based on wordCounts
Display the sorted word frequencies
Implemented Code
import java.util.Scanner;
public class WordFrequencyCounter {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.println("Word Frequency Counter");
System.out.println("Enter text (or type 'exit' to quit):");
String inputText = "";
while (true) {
String line = scanner.nextLine();
if (line.equalsIgnoreCase("exit")) {
break;
}
inputText += line + " ";
}
if (inputText.trim().isEmpty()) {
System.out.println("No input provided. Exiting...");
return;
}
// Tokenize the input text into words and remove punctuation
String[] words = inputText.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
// Count word frequencies
String[] uniqueWords = new String[words.length];
int[] wordCounts = new int[words.length];
int uniqueWordCount = 0;
for (String word : words) {
boolean isUnique = true;
for (int i = 0; i < uniqueWordCount; i++) {
if (uniqueWords[i].equals(word)) {
wordCounts[i]++;
isUnique = false;
break;
}
}
if (isUnique) {
uniqueWords[uniqueWordCount] = word;
wordCounts[uniqueWordCount] = 1;
uniqueWordCount++;
}
}
// Sort word frequencies in descending order
for (int i = 0; i < uniqueWordCount - 1; i++) {
for (int j = i + 1; j < uniqueWordCount; j++) {
if (wordCounts[i] < wordCounts[j]) {
// Swap word frequencies
int tempCount = wordCounts[i];
wordCounts[i] = wordCounts[j];
wordCounts[j] = tempCount;
// Swap corresponding words
String tempWord = uniqueWords[i];
uniqueWords[i] = uniqueWords[j];
uniqueWords[j] = tempWord;
}
}
}
// Display word frequencies
System.out.println("\nFrequency of each word:");
for (int i = 0; i < uniqueWordCount; i++) {
System.out.println("- " + uniqueWords[i] + ": " + wordCounts[i]);
}
}
}
Dry Run
To help you understand how the program works, let's perform a dry run with a sample input:
Input
This is a sample text. It contains several words. This text is used for testing the Word Frequency Counter.
Preprocessing: The input text is converted to lowercase and punctuation is removed.
this is a sample text it contains several words this text is used for testing the word frequency counter
Tokenization: The preprocessed text is split into individual words (tokens).
['this', 'is', 'a', 'sample', 'text', 'it', 'contains', 'several', 'words', 'this', 'text', 'is', 'used', 'for', 'testing', 'the', 'word', 'frequency', 'counter']
Counting Word Frequencies: The program counts the frequency of each word.
{'this': 2, 'is': 2, 'a': 1, 'sample': 1, 'text': 2, 'it': 1, 'contains': 1, 'several': 1, 'words': 1, 'used': 1, 'for': 1, 'testing': 1, 'the': 1, 'word': 1, 'frequency': 1, 'counter': 1}
Displaying the Results: The top 10 most frequent words are displayed.
this: 2 is: 2 text: 2 a: 1 sample: 1 it: 1 contains: 1 several: 1 words: 1 used: 1
Conclusion
The Word Frequency Counter project demonstrates how to analyze text using Java programming fundamentals. It offers valuable insights into string manipulation, arrays, and basic data structures. You can further enhance this project by adding features like excluding common words or analyzing larger text datasets.
By following the algorithm and pseudocode, you can adapt and expand this project to suit your needs, making it a versatile tool for text analysis and exploration in the Java programming world.
Subscribe to my newsletter
Read articles from Mahesh Kamalakar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by