Classifying Community Rule Violations with Transformers -A Walkthrough of My Jigsaw Kaggle Solution

Table of contents
- 📌 TL;DR
- 🧠 Problem Overview
- ✅ Input:
- 🎯 Output:
- 📁 Dataset & Files
- Exploratory Data Analysis
- 📊 Label Distribution
- 📝 Text Lengths
- 🛠️ Model Pipeline Overview
- 🔹Model: microsoft/deberta-v3-base
- 🔹Input Format
- 🔹 Tokenization
- 🔹 Custom Dataset
- 🧠 Model Architecture
- 🔁 Cross-Validation & Training
- Training Loop
- Example Results per Fold:
- 🧪 Test Inference
- 📈 Ideas for Improvement
- 📦 Conclusion
- 🔗 Useful Links
Author: Gauri Patil
Kaggle Notebook: Link| July 2025
📌 TL;DR
In this post, I’ll walk you through my solution for the Jigsaw Agile Community Rules Classification Kaggle competition. The challenge is to predict whether a community post violates a specific rule, given both the rule and the post body. I used a transformer-based model (DeBERTa v3), 5-fold cross-validation and some light EDA to build a competitive baseline submission.
🧠 Problem Overview
Online communities have rules things like no spam, no personal attacks, or no legal advice. This competition asks:
Given a rule and a post*, does the post violate that rule?*
Unlike classic text classification, this is a pairwise alignment problem: you’re not just classifying a post, but whether it violates a specific rule.
✅ Input:
rule
: A moderation rule (e.g., “No spam or self-promotion”)body
: The actual community postsubreddit
: (optional metadata)
🎯 Output:
- Binary label:
1
if the post violates the rule,0
otherwise
📁 Dataset & Files
The dataset has ~8,000 labeled samples and a separate test set.
train.csv
: Labeled training datatest.csv
: Unlabeled test datasample_submission.csv
: Submission format
Here’s a preview of the training data:
row_id body rule rule_violation
0 Click here No Advertising… 0
1 SD Stream… No Advertising… 0
2 Try appealing No Legal Advice 1
Exploratory Data Analysis
📊 Label Distribution
We noticed an imbalanced dataset, with more non-violations than violations:
sns.countplot(x='rule_violation', data=df)
📝 Text Lengths
Most rules are ~10–15 words long
Comments vary widely, but many fall in the 10–50 word range
🛠️ Model Pipeline Overview
🔹Model: microsoft/deberta-v3-base
DeBERTa-v3 is a strong transformer architecture known for sensible attention disentangling and robust performance on low-resource datasets.
🔹Input Format
I merged the rule
and body
into a single string:
df["text"] = df["rule"] + " [SEP] " + df["body"]
This way, the model treats the pair as a contextualized prompt, similar to NLI tasks.
🔹 Tokenization
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=False)
🔹 Custom Dataset
class JigsawDataset(Dataset):
def __getitem__(self, idx):
...
enc = tokenizer(text, padding='max_length', truncation=True, max_length=MAX_LEN)
...
🧠 Model Architecture
class JigsawModel(nn.Module):
def __init__(self, model_path):
self.base = AutoModel.from_pretrained(model_path)
self.drop = nn.Dropout(0.2)
self.out = nn.Linear(self.base.config.hidden_size, 1)
We take the CLS token (first hidden state), apply dropout and use a single linear output for binary classification.
🔁 Cross-Validation & Training
I used 5-Fold Stratified CV to ensure balanced label distribution in each fold:
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Training Loop
Each fold trains for 3 epochs, with the cosine scheduler and BCEWithLogitsLoss. We track AUC for validation:
loss = nn.BCEWithLogitsLoss()
auc = roc_auc_score(targets, preds)
Example Results per Fold:
Fold Best AUC
1 0.8591
2 0.8526
3 0.8941
4 0.8660
5 0.8346
🧪 Test Inference
For each fold, we:
Load the best model
Run predictions on the test set
Average the predictions for final output
final_preds = np.mean(test_preds, axis=0)
sample["rule_violation"] = final_preds
sample.to_csv("submission.csv", index=False)
📈 Ideas for Improvement
While this gives us a solid baseline, there’s room for optimization:
Threshold tuning: Instead of using 0.5 for binary decisions, tune based on F1 or AUC
Better input formatting: Use special tokens like
<rule>
,<comment>
or structure as promptsMulti-task learning: Use rule categories or subreddit as auxiliary tasks
Use of examples: Fine-tune with provided
positive_example_1
andnegative_example_2
columnsAdvanced ensembling: Try model snapshot averaging, rank averaging, or even DeBERTa + RoBERTa combo
📦 Conclusion
This problem was both fun and nuanced requiring a mix of text matching, rule reasoning, and robust validation. Using DeBERTa-v3 with clean cross-validation gave a strong and reliable starting point for leaderboard submission.
💬 Let me know if you want the full notebook or have questions about ensembling, pseudo-labeling or model interpretation!
🔗 Useful Links
Thanks for reading! If you liked this post, follow me for more Kaggle deep dives and NLP content. ✨
Subscribe to my newsletter
Read articles from Gauri Patil directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
