A Data Science Case Study: Petition Topic & Priority Analysis

Executive Summary

In task 1, I built a petition topic classifier using TF-IDF and metadata features. After data cleaning, entity extraction, and resampling, a linear SVM model achieved strong performance, meeting the misclassification threshold. Task 2 focused on predicting petition importance. I manually labelled over 100 samples and trained multiple models using text and numeric features. XGBoost emerged as the top performer, outperforming the baseline. Ethical risks like representation and framing hazards were considered. Both activities demonstrated technically sound results while highlighting the subjectivity and ethical complexity in modelling social and civic data.

Data Exploration & Assessment

I began by exploring the dataset to understand the frequency and the distribution of the petition topics, the quality of the ‘petition_text’ field, and the binary feature ‘has_entity’. The classes are imbalanced with the highest topic being ‘environment and animal welfare’ (2289) and the lowest being ‘london’ 60. Originally, each topic had two ways it was tracked, either in lower case or sentence case. The ‘petition_text’ did have empty values; it also had ‘empty_test_petition’ within the set. This exploratory analysis helped formulate my preprocessing steps.

Distribution of the petition topic.

Distribution of the petition topic.

Original topic categories before transformation.

Data Cleaning & Splitting

After the data exploration, I cleaned the dataset by removing the null values and the duplicated values from the overall dataset. I also removed the unnecessary symbols. After that, I removed the ‘empty_test_petition’ from the petition_text field. Then I split the data into a training and test set (80/20 split) to ensure that model evaluation accurately reflects real-world performance. This split helped monitor overfitting by comparing training and testing accuracies.

For Task two, SMOTE was used to address imbalance within the dataset, while Task 1 used Random Over-Sampling to address the class imbalance because there were too few samples for some of the topics to predict.

Data Encoding

The text from the ‘petition_text’ was transformed using a TF-IDF vectorization to capture the relative importance of the words, while the ‘has_entity’ column was split into three columns and encoded to make it easier to digest. The categories of the petition_topic were also encoded. This enabled the model to integrate the nuances from the petition narratives with clear entity signals and labels, which enhanced its predictivity and overall performance.

For task two, the label encoder, TF-IDF vectorization, and Feature engineering. In addition to these methods, Target variable mapping was used to map the petition_importance.

Column NamePreprocessing TechniqueRationale
petition_topicLabel EncoderConverts categorical data into numbers.
petition_textTF-IDF VectorizationCaptures word frequency important in text
has_entity ( has_event)Feature EngineeringGives the algorithm more detailed info.
has_entity (has_date)Feature EngineeringMaking it clear and straightforward
has_entity (has_person)Feature Engineeringfor the model to use.
Petition_importanceTarget Variable mappingMaking it compatible with the algorithm’s classifiers.

Here is the link to my Git Hub Repo.

0
Subscribe to my newsletter

Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Renisa Mangal-King
Renisa Mangal-King

Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.