Executive Summary

In task 1, I built a petition topic classifier using TF-IDF and metadata features. After data cleaning, entity extraction, and resampling, a linear SVM model achieved strong performance, meeting the misclassification threshold. Task 2 focused on predicting petition importance. I manually labelled over 100 samples and trained multiple models using text and numeric features. XGBoost emerged as the top performer, outperforming the baseline. Ethical risks like representation and framing hazards were considered. Both activities demonstrated technically sound results while highlighting the subjectivity and ethical complexity in modelling social and civic data.

Data Exploration & Assessment

I began by exploring the dataset to understand the frequency and the distribution of the petition topics, the quality of the ‘petition_text’ field, and the binary feature ‘has_entity’. The classes are imbalanced with the highest topic being ‘environment and animal welfare’ (2289) and the lowest being ‘london’ 60. Originally, each topic had two ways it was tracked, either in lower case or sentence case. The ‘petition_text’ did have empty values; it also had ‘empty_test_petition’ within the set. This exploratory analysis helped formulate my preprocessing steps.

Distribution of the petition topic.

Original topic categories before transformation.

Data Cleaning & Splitting

After the data exploration, I cleaned the dataset by removing the null values and the duplicated values from the overall dataset. I also removed the unnecessary symbols. After that, I removed the ‘empty_test_petition’ from the petition_text field. Then I split the data into a training and test set (80/20 split) to ensure that model evaluation accurately reflects real-world performance. This split helped monitor overfitting by comparing training and testing accuracies.

For Task two, SMOTE was used to address imbalance within the dataset, while Task 1 used Random Over-Sampling to address the class imbalance because there were too few samples for some of the topics to predict.

Data Encoding

The text from the ‘petition_text’ was transformed using a TF-IDF vectorization to capture the relative importance of the words, while the ‘has_entity’ column was split into three columns and encoded to make it easier to digest. The categories of the petition_topic were also encoded. This enabled the model to integrate the nuances from the petition narratives with clear entity signals and labels, which enhanced its predictivity and overall performance.

For task two, the label encoder, TF-IDF vectorization, and Feature engineering. In addition to these methods, Target variable mapping was used to map the petition_importance.

Column Name	Preprocessing Technique	Rationale
petition_topic	Label Encoder	Converts categorical data into numbers.
petition_text	TF-IDF Vectorization	Captures word frequency important in text
has_entity ( has_event)	Feature Engineering	Gives the algorithm more detailed info.
has_entity (has_date)	Feature Engineering	Making it clear and straightforward
has_entity (has_person)	Feature Engineering	for the model to use.
Petition_importance	Target Variable mapping	Making it compatible with the algorithm’s classifiers.

Here is the link to my Git Hub Repo.

A Data Science Case Study: Petition Topic & Priority Analysis

Executive Summary

Data Exploration & Assessment

Data Cleaning & Splitting

Data Encoding

Subscribe to my newsletter

Renisa Mangal-King

Renisa Mangal-King