Task 2: Key Insight Extraction


Ethical discussion
“Labelling petition importance raises ethical risks related to fairness, privacy, and value-laden judgments. Using the frameworks, the following had been identified:
Reinforces Existing Bias: If certain data points are underrepresented in the dataset or model, this could cause bias by prioritizing groups with the most data points. Underrepresented groups within a dataset might cause inequity.
Lacks Community Involvement: If labels are deemed “unimportant” because affected people aren’t contacted or consulted. Excluding relevant parties can result in overlooking key perspectives, leading to incomplete or biased data prioritization.
Uncertain Accuracy of Source Data: Data underrepresentation stems from incomplete or unverified data (e.g., missing data from certain demographics). This label applies because it highlights risk from unreliable data.
Data labelling
I manually labelled over 100 petitions using a set of criteria to separate what is ‘important’ from what is ‘not_important’. For instance, petitions that referenced national impact or frequently debated topics, were labelled as ‘important’. Petitions with unclear objectives, self-serving ones, or little public relevance were marked as ‘not_important’. Some petitions were context-dependent, and their importance can shift depending on the time or audience. This could introduce labelling bias.
Model building and evaluation
To complete the petition importance classification task, I built and evaluated four supervised learning models: Random Forest, XGBoost, Logistic Regression, and Linear Support Vector Machine. All models were trained on a feature set combining TF-IDF representations of the petition text with engineered features including has_event, has_date, has_person, deviation_across_regions, and text_length. These were chosen based on their potential relevance to the petition’s public resonance and structural characteristics. Due to the small size of the labelled dataset, I applied SMOTE oversampling to the training set to balance the classes before training.
The performance of each model was evaluated using the macro-averaged F1-score to account for the class imbalance while equally weighting both classes. The summary below:
Model | Macro F1-Score |
Random Forest | 0.638 |
XGBoost | 0.642 |
Logistic Regression | 0.589 |
Support Vector Machine | 0.608 |
All the models outperformed the majority class baseline, with Random Forest and XGBoost yielding the best scores. Confusion matrices indicate that the most misclassifications occurred on borderline cases.
Yes, it is feasible to automatically identify petition importance. But due to the task’s inherent ambiguity and potential biases, the model is best viewed as an exploratory prototype, and any deployment would require more stakeholder input.
Hyperparameter | Value | Description |
use_label_encoder | False | Removed internal label encoder |
eval_metric | ‘logloss’ | Specify the loss function |
max_depth | 6 | Max depth of the tree |
n_estimators | 200 | Number of trees to fit |
random_state | 42 | Ensures reproducibility of results |
Conculsion
All models over 0.50, with the XGBoost achieving the highest macro F1-score of 0.642. This confirms that the model could distinguish between “important’ and “not_important” petitions. Balanced training with SMOTR and feature engineering contributed to the stable performance.
This task is achievable by using classifiers and linguistic features. However, the labelling remains a challenge; the model's success relies heavily on how labels were defined and interpreted during annotation.
Things that could be improved are:
Exploring NLP features with BERT or sentence-transformers for deeper semantic understanding. It could use more external metrics that could boost prediction, such as regional relevance or the number of signatures received.
The XGBoost performed well on average, but the confusion matrix did show us that ambiguity persists. Since it is based on the context and how people perceive it, the final prediction should be used with care and always reviewed through a human lens.
Confusion Matrix
Subscribe to my newsletter
Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Renisa Mangal-King
Renisa Mangal-King
Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.