Task 1 Topic Modelling & Classification


Model building
For the classification task, I experimented with several algorithms, including Logistic regression and Multinomial Naïve Bayes. Linear SVM was chosen because of its ability to handle outliers, the data set has a clear separation of the petition topics (classes), and it achieves high predictive accuracy.
Hyperparameter | Value | Description |
Penalty | ‘l2’ | Default parameter used to prevent overfitting |
loss | ‘squared_hinge’ | Default parameter for LinearSVC, commonly used for classification tasks |
C | 1.0 | Default parameter: Inverse of regularization strength |
class_weight | None | |
max_iter | 1000 | Ensures coverage for larger datasets |
dual | True | Solves the dual optimization problem |
random_state | Not Set |
Model evaluation
The final model was evaluated using standard metrics along with the confusion matrix. The test accuracy was 90%, meeting the client’s first requirement. Additionally, a detailed analysis of the confusion matrix showed that in at least five of the seven classes, less than 13% of the petitions were misclassified as unrelated topics. Importantly, the ‘uk government and devolution’ category exhibited an error rate of 6.29%, thereby satisfying the critical threshold for that class. To ensure that the model’s performance is continuously tracked in a balanced manner across all classes.
Confusion Matrix
Confusion Matrix Table
Topic | Total | Correct | Misclassified | Misclassified % |
Culture, sport, and media | 203 | 186 | 17 | 8.37 |
Economy, labour, and welfare | 238 | 201 | 37 | 15.54 |
Education | 231 | 214 | 17 | 7.36 |
Environment and animal welfare | 458 | 435 | 23 | 5.02 |
Health and social care | 408 | 352 | 56 | 13.73 |
London | 12 | 10 | 2 | 16.67 |
UK government and devolution | 143 | 134 | 9 | 6.29 |
Total | 1693 | 1532 | 161 | 9.51 |
Performance Metrics
Metric | Value | Definition |
Accuracy | 0.904903 | 90.5% of all the petitions were correctly classified into their respective topics. |
Precision | 0.905007 | When the model predicted a topic correctly about 90.5% of the time. |
Recall | 0.904903 | The model successfully identifies 90.5% of the true topic labels. |
F1 Score | 0.904519 | It is a balanced measure combining precision and recall. |
Conculsion
The Linear SVM model demonstrated consistent and accurate classification across all seven petition topics, achieving a macro F1-score of 0.90 and an overall accuracy 90%. It satisfied all client performance thresholds, with the “UK government and devolution” achieving a recall of 0.94, staying well within the 9% misclassification limit.
While performance was strong overall, minor deviations occurred in topics like “culture, sport and media” and “London”, suggesting subtle semantic overlap or data sparsity. Future enhancements could include:
Expanding the TF-IDF vectorizer
Applying Class-specific augmentation to improve underrepresented topics.
Investigate BERT-based embedding to better capture semantic nuances beyond shallow lexical features.
These refinements offer a path toward an even higher robust and interpretable model in real-world deployment.
Subscribe to my newsletter
Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Renisa Mangal-King
Renisa Mangal-King
Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.