A Data Science Case Study: Petition Topic & Priority Analysis


Executive Summary
In task 1, I built a petition topic classifier using TF-IDF and metadata features. After data cleaning, entity extraction, and resampling, a linear SVM model achieved strong performance, meeting the misclassification threshold. Task 2 focused on predicting petition importance. I manually labelled over 100 samples and trained multiple models using text and numeric features. XGBoost emerged as the top performer, outperforming the baseline. Ethical risks like representation and framing hazards were considered. Both activities demonstrated technically sound results while highlighting the subjectivity and ethical complexity in modelling social and civic data.
Data Exploration & Assessment
I began by exploring the dataset to understand the frequency and the distribution of the petition topics, the quality of the ‘petition_text’ field, and the binary feature ‘has_entity’. The classes are imbalanced with the highest topic being ‘environment and animal welfare’ (2289) and the lowest being ‘london’ 60. Originally, each topic had two ways it was tracked, either in lower case or sentence case. The ‘petition_text’ did have empty values; it also had ‘empty_test_petition’ within the set. This exploratory analysis helped formulate my preprocessing steps.
Distribution of the petition topic.
Original topic categories before transformation.
Data Cleaning & Splitting
After the data exploration, I cleaned the dataset by removing the null values and the duplicated values from the overall dataset. I also removed the unnecessary symbols. After that, I removed the ‘empty_test_petition’ from the petition_text field. Then I split the data into a training and test set (80/20 split) to ensure that model evaluation accurately reflects real-world performance. This split helped monitor overfitting by comparing training and testing accuracies.
For Task two, SMOTE was used to address imbalance within the dataset, while Task 1 used Random Over-Sampling to address the class imbalance because there were too few samples for some of the topics to predict.
Data Encoding
The text from the ‘petition_text’ was transformed using a TF-IDF vectorization to capture the relative importance of the words, while the ‘has_entity’ column was split into three columns and encoded to make it easier to digest. The categories of the petition_topic were also encoded. This enabled the model to integrate the nuances from the petition narratives with clear entity signals and labels, which enhanced its predictivity and overall performance.
For task two, the label encoder, TF-IDF vectorization, and Feature engineering. In addition to these methods, Target variable mapping was used to map the petition_importance.
Column Name | Preprocessing Technique | Rationale |
petition_topic | Label Encoder | Converts categorical data into numbers. |
petition_text | TF-IDF Vectorization | Captures word frequency important in text |
has_entity ( has_event) | Feature Engineering | Gives the algorithm more detailed info. |
has_entity (has_date) | Feature Engineering | Making it clear and straightforward |
has_entity (has_person) | Feature Engineering | for the model to use. |
Petition_importance | Target Variable mapping | Making it compatible with the algorithm’s classifiers. |
Here is the link to my Git Hub Repo.
Subscribe to my newsletter
Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Renisa Mangal-King
Renisa Mangal-King
Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.