Resume Classification Using Machine Learning and NLP (UpdatedResumeDataSet.csv – Kaggle)

Lokesh PatidarLokesh Patidar
3 min read

🔍 Project Overview

In this project, I built a Resume Classification System using Machine Learning and Natural Language Processing (NLP) techniques. The goal was to automatically classify resumes into categories like IT, HR, Healthcare, etc., based on their content — a powerful tool for HR and recruitment automation.

The dataset used was the UpdatedResumeDataSet.csv from Kaggle.


→Key Concepts Applied

  • Text Preprocessing (Cleaning)

  • Label Encoding (categorical conversion)

  • TF-IDF Vectorization

  • Train-test split

  • Model Training (KNN, SVC, RandomForest)

  • Resume Prediction Function

  • Model Evaluation


📂 Dataset Description

The dataset contains two columns:

  • Resume: Free-form text data containing the full resume content.

  • Category: The job field the resume falls under (e.g., IT, HR, Sales, etc.)


→Step 1: Data Cleaning & Preprocessing

I created a cleanResume() function to:

  • Remove URLs, punctuation, stopwords, and special characters

  • Convert all text to lowercase

  • Tokenize and lemmatize words

This step is critical to ensure that the data is clean and uniform for vectorization.


→Step 2: Label Encoding

Used LabelEncoder() from sklearn.preprocessing to convert resume categories (strings) into numerical values that machine learning models can understand.


📊 Step 3: Vectorization (TF-IDF)

Converted the cleaned text into a numerical format using TF-IDF (Term Frequency – Inverse Document Frequency), which helps identify important words in each resume without being biased toward common terms.

pythonCopyEditfrom sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=5000)
X = tfidf.fit_transform(cleaned_resumes)

→Step 4: Splitting Dataset

Split the data into training and testing sets:

pythonCopyEditfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, encoded_labels, test_size=0.2, random_state=42)

🤖 Step 5: Model Training

I trained three different models:

  1. KNeighborsClassifier

  2. SVC (Support Vector Classifier)

  3. RandomForestClassifier

Evaluated their accuracy and performance on the test set.


→ Prediction Function

Here's the function that takes a resume and returns its predicted category:

pythonCopyEditdef pred(input_resume):
    cleaned_text = cleanResume(input_resume)
    vectorized_text = tfidf.transform([cleaned_text]).toarray()
    predicted_category = svc_model.predict(vectorized_text)
    predicted_category_name = le.inverse_transform(predicted_category)
    return predicted_category_name[0]

→Final Result

I deployed a prediction system that classifies resumes into fields like:

  • IT

  • HR

  • Health & Fitness

  • Sales

  • Teaching

  • And many more...

Tested it using real resume samples, and it accurately returned categories using the SVC model!


🔧 Tools & Technologies Used

  • Python

  • Scikit-learn

  • Pandas, NumPy

  • TF-IDF Vectorizer

  • LabelEncoder

  • ML Models: KNN, SVC, Random Forest


→What I Learned

  • How to clean and preprocess real-world text data

  • Importance of choosing the right vectorization and ML model

  • How to build an end-to-end NLP pipeline for classification

  • Model saving and deployment using pickle


→ What’s Next?

  • Create a web app using Streamlit or Flask for interactive predictions

  • Add more data and fine-tune hyperparameters for better accuracy

  • Explore deep learning methods (e.g., BERT) for improved results


→Let’s Connect

If you're working on a similar project or want to collaborate on NLP/ML, feel free to connect!

Linkedin


#MachineLearning #NLP #ResumeClassifier #Python #TFIDF #Kaggle #Sklearn #AI #Hashnode #MLProjects

0
Subscribe to my newsletter

Read articles from Lokesh Patidar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Lokesh Patidar
Lokesh Patidar

Hey, I'm Lokesh Patidar! I'm a 2nd-year student at SATI Vidisha, passionate about AI, Machine Learning, Full-Stack Development , and DSA. What I'm Learning: Currently Exploring Machine Learning 🤖 Completed DSA & Frontend Development 🌐 Now exploring Backend Development 💡 Interests: I love solving problems, building projects, and integrating AI into real-world applications. Excited to contribute to tech communities and share my learning journey! 📌 Follow my blog for insights on AI, ML, and Full-Stack projects!