Resume Classification Using Machine Learning and NLP (UpdatedResumeDataSet.csv – Kaggle)


🔍 Project Overview
In this project, I built a Resume Classification System using Machine Learning and Natural Language Processing (NLP) techniques. The goal was to automatically classify resumes into categories like IT, HR, Healthcare, etc., based on their content — a powerful tool for HR and recruitment automation.
The dataset used was the UpdatedResumeDataSet.csv from Kaggle.
→Key Concepts Applied
Text Preprocessing (Cleaning)
Label Encoding (categorical conversion)
TF-IDF Vectorization
Train-test split
Model Training (KNN, SVC, RandomForest)
Resume Prediction Function
Model Evaluation
📂 Dataset Description
The dataset contains two columns:
Resume
: Free-form text data containing the full resume content.Category
: The job field the resume falls under (e.g., IT, HR, Sales, etc.)
→Step 1: Data Cleaning & Preprocessing
I created a cleanResume()
function to:
Remove URLs, punctuation, stopwords, and special characters
Convert all text to lowercase
Tokenize and lemmatize words
This step is critical to ensure that the data is clean and uniform for vectorization.
→Step 2: Label Encoding
Used LabelEncoder()
from sklearn.preprocessing
to convert resume categories (strings) into numerical values that machine learning models can understand.
📊 Step 3: Vectorization (TF-IDF)
Converted the cleaned text into a numerical format using TF-IDF (Term Frequency – Inverse Document Frequency), which helps identify important words in each resume without being biased toward common terms.
pythonCopyEditfrom sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(cleaned_resumes)
→Step 4: Splitting Dataset
Split the data into training and testing sets:
pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, encoded_labels, test_size=0.2, random_state=42)
🤖 Step 5: Model Training
I trained three different models:
KNeighborsClassifier
SVC (Support Vector Classifier)
RandomForestClassifier
Evaluated their accuracy and performance on the test set.
→ Prediction Function
Here's the function that takes a resume and returns its predicted category:
pythonCopyEditdef pred(input_resume):
cleaned_text = cleanResume(input_resume)
vectorized_text = tfidf.transform([cleaned_text]).toarray()
predicted_category = svc_model.predict(vectorized_text)
predicted_category_name = le.inverse_transform(predicted_category)
return predicted_category_name[0]
→Final Result
I deployed a prediction system that classifies resumes into fields like:
IT
HR
Health & Fitness
Sales
Teaching
And many more...
Tested it using real resume samples, and it accurately returned categories using the SVC model!
🔧 Tools & Technologies Used
Python
Scikit-learn
Pandas, NumPy
TF-IDF Vectorizer
LabelEncoder
ML Models: KNN, SVC, Random Forest
→What I Learned
How to clean and preprocess real-world text data
Importance of choosing the right vectorization and ML model
How to build an end-to-end NLP pipeline for classification
Model saving and deployment using
pickle
→ What’s Next?
Create a web app using Streamlit or Flask for interactive predictions
Add more data and fine-tune hyperparameters for better accuracy
Explore deep learning methods (e.g., BERT) for improved results
→Let’s Connect
If you're working on a similar project or want to collaborate on NLP/ML, feel free to connect!
#MachineLearning #NLP #ResumeClassifier #Python #TFIDF #Kaggle #Sklearn #AI #Hashnode #MLProjects
Subscribe to my newsletter
Read articles from Lokesh Patidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Lokesh Patidar
Lokesh Patidar
Hey, I'm Lokesh Patidar! I'm a 2nd-year student at SATI Vidisha, passionate about AI, Machine Learning, Full-Stack Development , and DSA. What I'm Learning: Currently Exploring Machine Learning 🤖 Completed DSA & Frontend Development 🌐 Now exploring Backend Development 💡 Interests: I love solving problems, building projects, and integrating AI into real-world applications. Excited to contribute to tech communities and share my learning journey! 📌 Follow my blog for insights on AI, ML, and Full-Stack projects!