Task 1 Topic Modelling & Classification

Model building

For the classification task, I experimented with several algorithms, including Logistic regression and Multinomial Naïve Bayes. Linear SVM was chosen because of its ability to handle outliers, the data set has a clear separation of the petition topics (classes), and it achieves high predictive accuracy.

HyperparameterValueDescription
Penalty‘l2’Default parameter used to prevent overfitting
loss‘squared_hinge’Default parameter for LinearSVC, commonly used for classification tasks
C1.0Default parameter: Inverse of regularization strength
class_weightNone
max_iter1000Ensures coverage for larger datasets
dualTrueSolves the dual optimization problem
random_stateNot Set

Model evaluation

The final model was evaluated using standard metrics along with the confusion matrix. The test accuracy was 90%, meeting the client’s first requirement. Additionally, a detailed analysis of the confusion matrix showed that in at least five of the seven classes, less than 13% of the petitions were misclassified as unrelated topics. Importantly, the ‘uk government and devolution’ category exhibited an error rate of 6.29%, thereby satisfying the critical threshold for that class. To ensure that the model’s performance is continuously tracked in a balanced manner across all classes.

Confusion Matrix

Confusion Matrix Table

TopicTotalCorrectMisclassifiedMisclassified %
Culture, sport, and media203186178.37
Economy, labour, and welfare2382013715.54
Education231214177.36
Environment and animal welfare458435235.02
Health and social care4083525613.73
London1210216.67
UK government and devolution14313496.29
Total169315321619.51

Performance Metrics

MetricValueDefinition
Accuracy0.90490390.5% of all the petitions were correctly classified into their respective topics.
Precision0.905007When the model predicted a topic correctly about 90.5% of the time.
Recall0.904903The model successfully identifies 90.5% of the true topic labels.
F1 Score0.904519It is a balanced measure combining precision and recall.

Conculsion

The Linear SVM model demonstrated consistent and accurate classification across all seven petition topics, achieving a macro F1-score of 0.90 and an overall accuracy 90%. It satisfied all client performance thresholds, with the “UK government and devolution” achieving a recall of 0.94, staying well within the 9% misclassification limit.

While performance was strong overall, minor deviations occurred in topics like “culture, sport and media” and “London”, suggesting subtle semantic overlap or data sparsity. Future enhancements could include:

Expanding the TF-IDF vectorizer

Applying Class-specific augmentation to improve underrepresented topics.

Investigate BERT-based embedding to better capture semantic nuances beyond shallow lexical features.

These refinements offer a path toward an even higher robust and interpretable model in real-world deployment.

0
Subscribe to my newsletter

Read articles from Renisa Mangal-King directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Renisa Mangal-King
Renisa Mangal-King

Passionate and driven data science student with a focus on leveraging data analytics and engineering to solve complex business problems, particularly within the fintech sector. With a strong foundation in modern data stacks, statistical modeling, and machine learning, I am dedicated to transforming raw data into actionable insights that drive growth and innovation.