From Raw to Refined: Preparing Telecom Data for AI Projects

Cleaning and Preprocessing Telecom & User Data for AI
Ever tried teaching a toddler calculus? That’s what it's like asking an AI to learn from raw, messy data. You won’t be getting predictions, you’ll be getting chaos.
So before any AI can predict customer churn like a clairvoyant, we’ve got some heavy lifting to do. Our job? Give our dataset the full teen-movie makeover.Wash off the dirt, fix the missing bits, iron out the wrinkles, and make sure every feature shows up to the AI party like it just walked out of a dramatic montage scene in a 2000s high school rom-com.
Predicting who is about to break up with you (a.k.a Churn)
Welcome to the world of churn prediction, where telecom companies are in full-on panic mode trying to stop customers from ghosting them.
The dataset we’re using is a real-world collection of customer profiles, including:
Demographics (like gender, age group)
Services they’re using (internet, streaming, phone lines)
Payment info and billing amounts
And whether they’ve said “boy, bye” to the company
We are going to clean and prepare the data so that it can be used to train an AI to predict if the customer is about to leave, based on their usage patterns and behaviour. But before we feed the data into the AI, we need to clean and prep the data
So in this blog, we’ll:
Take raw, messy customer data
Clean and transform it so machines can read it
Analyze patterns that affect churn
And prep everything for modeling (coming in Blog #2)
1.The Grand Unveiling: Peeking into the Data
We start with loading the Telco Customer Churn dataset ,a charming mess of 7043 rows and 21 columns, featuring everything from customer ID and billing details to whether someone has churned.
We look at:
head() : the first few rows, like the opening act of a concert.
info() : gives us data types, missing values, and nulls.
describe() : gives stats on numeric columns (mean, std, max, etc.)
Takeaway? Most of our data is stored as objects (i.e., strings), and one column (
TotalCharges
) is supposed to be numeric... but isn’t. Classic.2. Zeros and Missing Charges
Some customers have a tenure of zero (they just joined), but somehow they already have a TotalCharges amount filled in.This is like someone saying they just entered a restaurant and already paid for dessert.
So we:
Convert TotalCharges to numeric using errors = ‘coerce’ to catch weird non-numeric entries.
Set those suspicious charge values to zero. If your tenure is 0, your charges better be too.
You see, AI models are like picky eaters. They only want numbers. So we convert columns like TotalCharges (which might be read as strings because of a few rogue spaces or nulls) into proper numeric types. This is like making sure your actor learns the script before filming starts. The suspicious TotalCharges value? turns out, they weren't real values at all, just pretending to be.
So next step,we call in our trusty friend fillna() to handle the blanks like a pro:
3.Making Categorical Data Behave
Let’s talk about the divas of our dataset, the ‘object’ columns. These are the text-based columns that describe your customers with words instead of numbers. You know, the drama queens like ‘Yes’,’No’,’Month-to-Month’,’Male’,’Female’, and so on.
Humans? We love words.
AI? Not so much.
To an AI model, Yes’ and ’No’are just arbitrary noise. ’Male’ and ’Female might as well be ‘Penguin’ and ‘DragonFruit’.Without numbers, it can’t find any pattern. So our next task is to transform the words into numeric data.
3.1 Label Encoding:
Label Encoding is ma straightforward transformation we use when the category has only two possible values.
We’re talking simple binaries like Yes/No or Male/Female
In Label Encoding, we assign:
'Yes’→1, ‘No’→0
‘Female’→1,’Male’→0
Why? Because that’s all the model needs to get the message: one thing means "on", the other means "off". Think of it like giving the AI a tiny cheat sheet instead of a novel.
3.2 One-Hot Encoding:
One-Hot Encoding is when categories demand their own stage, rhe high-maintenance columns ,the ones with more than two categories. These are your 'InternetService', 'Contract', 'PaymentMethod kind of columns. Label Encoding wouldn’t work here — because assigning numbers like 0, 1, 2 would trick the model into thinking there's an order or ranking. (Spoiler: there isn't.)
So what do we do?
We pull out the big guns: One-Hot Encoding. This technique creates a whole new column for each category — turning one column into several boolean flags (0 or 1). Each row gets a '1' in the column that matches its category, and '0' in all others.
Example:
If a customer has a ‘Two Year’ contract, they’ll get:
Contract_Month to Month: 0
Contract_One Year: 0
Contract_Two Years: 1
It’s like casting each category in its own role in a Broadway ensemble. No one’s the main character but they all get a moment to shine.
We also have to drop the deadweight. The columns which doesnt serve a purpose while training the AI Model like customer ID, customer Name etcdf = df.drop('customerID', axis=1)
df = df.drop('customerID', axis=1)
4.Exploratory Data Analysis (EDA)
Before you can build a model, you gotta meet your data like you're on a first date. You wouldn’t marry someone without asking questions first, right? Same goes for datasets.
EDA- Exploratory Data Analysis is that first date.
It’s when you take your data out for a coffee, ask some deep questions, get a sense of its vibes, red flags, weird habits, and possible secrets it’s hiding behind its columns.
In short?
It’s where you look before you leap into modeling.Now that the dataset speaks AI, it’s time to look for patterns. Who’s churning? Who’s sticking around?
Because if you skip it, your model could:
Learn nonsense relationships (“Oh, customers with 2 in their ID churn more!”)
Choke on missing values like it’s lactose intolerant
Think ‘Yes’ > ‘No’ because you fed it unencoded strings
Be biased, skewed, or just plain wrong
EDA is your early-warning system. It tells you:
What’s wrong with your data
What’s interesting about it
What might help your model learn better
And it helps you plan your next moves like a boss.
4.1 Class Imbalance Check
This is major when you’re predicting churn (or anything else binary).
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()
And we realize: Oh wow, like 73% of people didn’t churn. That’s a huge imbalance.
That means the model could get lazy and just predict “no churn” for everyone and still get decent accuracy. We'll need to balance this later during modeling
4.2 Relationships Between Features
Now comes the spicy part,’Exploring relationships’.
We ask:
Does tenure impact churn?
Do people with higher monthly charges churn more?
Does the contract type make a difference?
So essentially, What makes a customer stick around versus ghost the service provider?
4.2.1 Tenure Vs Churn
“It’s not you, it’s... oh wait, maybe it is you.”
sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure vs Churn')
plt.show()
We’re plotting the number of months (tenure) a customer has been with the company, separated by whether they churned or not
.
What we see:
People who left generally had a shorter tenure.
The longer someone has stayed, the less likely they are to churn.
Makes sense, right?
If I’ve been using the same Wi-Fi for 3 years, chances are I’ve either become numb to the pain or I’m on a long-term contract. Either way, I’m not leaving.
4.2.2 Monthly Charges vs. Churn and TotalCharges vs. Churn:
“Breakups are expensive.”
sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges vs Churn')
plt.show()
sns.boxplot(x='Churn', y='TotalCharges', data=df)
plt.title('Total Charges vs Churn')
plt.show()
This boxplot is showing us how much people pay per month and how that relates to whether they churned
.What we see:
Customers with higher monthly charges are more likely to leave.
There’s a sharp rise in churn in the higher charge range.
The moral? People hate paying more than they feel something’s worth especially when Netflix isn’t loading during the season finale.
4.2.3 Contract Type vs. Churn:
“Commitment issues? We see you.”
Before we plot, we first reverse-engineer the one-hot encoded contract columns to get a readable version and then plot it.
df['ContractType'] = df[[
'Contract_Month-to-month',
'Contract_One year',
'Contract_Two year'
]].idxmax(axis=1).str.replace('Contract_', '')
sns.countplot(x='ContractType', hue='Churn', data=df)
plt.title('Churn by Contract Type')
plt.show()
We’re checking what kind of contracts customers had and how often they churned.
What we see:
Month-to-month customers churn like it’s a trend.
Two-year contracts? Very little churn.
Contracts = commitment.
The less committed the contract, the more likely the customer is to peace out. (Relationship advice, anyone?)
4.2.4 Internet Service Type vs. Churn:
“No internet? No reason to stay.”
Same trick as before: first reverse-engineer the one-hot encoded contract columns to get a readable version and then plot it.
df['InternetType'] = df[[
'InternetService_DSL',
'InternetService_Fiber optic',
'InternetService_No'
]].idxmax(axis=1).str.replace('InternetService_', '')
sns.countplot(x='InternetType', hue='Churn', data=df)
plt.title('Churn by Internet Service Type')
plt.show()
What we see:
Customers with fiber optic internet churn more. Possibly because it’s more expensive.
People with no internet service don’t churn, probably because there’s not much to churn from.
Hypothesis?
People using basic DSL are content, fiber optic users might feel it’s not worth the cost, and non-users are probably just keeping the phone line active.
4.2.5 Payment Method vs. Churn:
“How you pay = how likely you’ll stay.”
You know the deal with reverse-engineering the one-hot coded data
df['PaymentType'] = df[[
'PaymentMethod_Electronic check',
'PaymentMethod_Mailed check',
'PaymentMethod_Bank transfer (automatic)',
'PaymentMethod_Credit card (automatic)'
]].idxmax(axis=1).str.replace('PaymentMethod_', '')
sns.countplot(x='PaymentType', hue='Churn', data=df)
plt.title('Churn by Payment Method')
plt.xticks(rotation=30)
plt.show()
We’re asking: does your payment method influence your loyalty?
What we see:
Electronic check users churn a lot more.
Auto payments (bank or credit) are associated with lower churn.
Insight: Friction in the payment process might push people away. Autopay = autopilot = they forget they’re even paying.
Our Dataset's Red Flags
Feature | High Churn Risk Group |
Tenure | Short-term customers |
Monthly Charges | High spenders |
Contract Type | Month-to-month |
Internet Type | Fiber optic users |
Payment Method | Electronic check |
4.3 Correlation Heatmap
Before we send our AI model off to make predictions, let’s eavesdrop on the relationships between features. Who’s secretly flirting with Churn behind the scenes?
Enter: The Correlation Heatmap
Think of it like this: we’ve invited all our features to a party. The correlation heatmap tells us who’s vibing (positively correlated), who’s got beef (negatively correlated), and who’s just awkwardly ignoring each other (no correlation).
The values range from:
+1: besties - as one goes up, so does the other
-1: sworn enemies - as one goes up, the other goes down
0: total strangers
Colors help:
Red: strong positive bond
Blue: strong negative bond
White-ish: meh, neutral
But we care about Churn. So who’s whispering secrets in Churn’s ear?
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(numeric_only=True), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()
What the Heatmap Shows
Churn is:
Negatively correlated with:
Tenure (−0.35): People who stick around longer tend not to churn. Obvious, but good to confirm.
Contract_two year and Contract_One Yea: Long contracts = less churn.
OnlineSecurity (−0.28) & TechSupport (−0.26): Customers with support/security are way less likely to leave.
Positively correlated with:
MonthlyCharges (+0.19): Higher bills lead to higher churn. No surprises there.
Contract_Month-to-Month: These folks churn more, because they can.
Other Things Worth Noting:
Tenure and TotalCharges: Very positively correlated (0.83) → Makes sense: longer tenure, more total money.
OnlineSecurity, TechSupport, and DeviceProtection are strongly related to each other → They could be bundled services.
InternetService_FibreOptics has a positive correlation with churn, while InternetService_No has a negative correlation → People with fiber seem to leave more, maybe due to expectations vs reality?
This heatmap is your AI intuition cheat code:
It tells you which features actually matter for churn prediction.
Helps you avoid collinearity traps in modeling (don’t feed the model redundant info).
Gives you ammo for feature selection, hypothesis testing, and feature engineering.
Conclusion:
We started this journey with a messy pile of raw customer data, kinda like the "before" scene in every teen movie. But after some heavy-duty cleaning, encoding, and a glam makeover (thanks to EDA), our dataset is finally red-carpet ready for machine learning.
We’ve:
Ironed out missing values and fixed shady datatypes.
Trained our categorical rebels (objects) to behave using encoding.
Dug deep with EDA to spot what really drives customers to leave.
Found out long-term users and 2-year contract folks are loyal ride-or-dies.
Confirmed that high bills and no tech support are a recipe for churn drama.
And we visualized it all with heatmaps and boxplots that spilled all the tea.
Now, our data’s not just cleaned, it’s insightful. It knows who's ghosting and why. So the next time we build a churn prediction model, we’ll have more than just numbers we’ll have context, correlations, and clues.
In the next blog, we’ll move from exploration to prediction. We'll train a range of models to see how well we can forecast customer churn, evaluate their performance, and even interpret which features most influence churn behavior.
P.S. If you’re curious about the code behind all this magic, I’ve got it all up on [GitHub].
Go snoop, fork, or run it yourself!
Subscribe to my newsletter
Read articles from Thejaswini Arun directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
