🔍 Project Overview

In this project, I built a Resume Classification System using Machine Learning and Natural Language Processing (NLP) techniques. The goal was to automatically classify resumes into categories like IT, HR, Healthcare, etc., based on their content — a powerful tool for HR and recruitment automation.

The dataset used was the UpdatedResumeDataSet.csv from Kaggle.

→Key Concepts Applied

Text Preprocessing (Cleaning)
Label Encoding (categorical conversion)
TF-IDF Vectorization
Train-test split
Model Training (KNN, SVC, RandomForest)
Resume Prediction Function
Model Evaluation

📂 Dataset Description

The dataset contains two columns:

Resume: Free-form text data containing the full resume content.
Category: The job field the resume falls under (e.g., IT, HR, Sales, etc.)

→Step 1: Data Cleaning & Preprocessing

I created a cleanResume() function to:

Remove URLs, punctuation, stopwords, and special characters
Convert all text to lowercase
Tokenize and lemmatize words

This step is critical to ensure that the data is clean and uniform for vectorization.

→Step 2: Label Encoding

Used LabelEncoder() from sklearn.preprocessing to convert resume categories (strings) into numerical values that machine learning models can understand.

📊 Step 3: Vectorization (TF-IDF)

Converted the cleaned text into a numerical format using TF-IDF (Term Frequency – Inverse Document Frequency), which helps identify important words in each resume without being biased toward common terms.

pythonCopyEditfrom sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(cleaned_resumes)

→Step 4: Splitting Dataset

Split the data into training and testing sets:

pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, encoded_labels, test_size=0.2, random_state=42)

🤖 Step 5: Model Training

I trained three different models:

KNeighborsClassifier
SVC (Support Vector Classifier)
RandomForestClassifier

Evaluated their accuracy and performance on the test set.

→ Prediction Function

Here's the function that takes a resume and returns its predicted category:

pythonCopyEditdef pred(input_resume):
    cleaned_text = cleanResume(input_resume)
    vectorized_text = tfidf.transform([cleaned_text]).toarray()
    predicted_category = svc_model.predict(vectorized_text)
    predicted_category_name = le.inverse_transform(predicted_category)
    return predicted_category_name[0]

→Final Result

I deployed a prediction system that classifies resumes into fields like:

IT
HR
Health & Fitness
Sales
Teaching
And many more...

Tested it using real resume samples, and it accurately returned categories using the SVC model!

🔧 Tools & Technologies Used

Python
Scikit-learn
Pandas, NumPy
TF-IDF Vectorizer
LabelEncoder
ML Models: KNN, SVC, Random Forest

→What I Learned

How to clean and preprocess real-world text data
Importance of choosing the right vectorization and ML model
How to build an end-to-end NLP pipeline for classification
Model saving and deployment using pickle

→ What’s Next?

Create a web app using Streamlit or Flask for interactive predictions
Add more data and fine-tune hyperparameters for better accuracy
Explore deep learning methods (e.g., BERT) for improved results

→Let’s Connect

If you're working on a similar project or want to collaborate on NLP/ML, feel free to connect!

#MachineLearning #NLP #ResumeClassifier #Python #TFIDF #Kaggle #Sklearn #AI #Hashnode #MLProjects

Resume Classification Using Machine Learning and NLP (UpdatedResumeDataSet.csv – Kaggle)