From Raw to Refined: Preparing Telecom Data for AI Projects

Thejaswini ArunThejaswini Arun
11 min read

Cleaning and Preprocessing Telecom & User Data for AI

Ever tried teaching a toddler calculus? That’s what it's like asking an AI to learn from raw, messy data. You won’t be getting predictions, you’ll be getting chaos.

So before any AI can predict customer churn like a clairvoyant, we’ve got some heavy lifting to do. Our job? Give our dataset the full teen-movie makeover.Wash off the dirt, fix the missing bits, iron out the wrinkles, and make sure every feature shows up to the AI party like it just walked out of a dramatic montage scene in a 2000s high school rom-com.

Predicting who is about to break up with you (a.k.a Churn)

Welcome to the world of churn prediction, where telecom companies are in full-on panic mode trying to stop customers from ghosting them.

The dataset we’re using is a real-world collection of customer profiles, including:

  • Demographics (like gender, age group)

  • Services they’re using (internet, streaming, phone lines)

  • Payment info and billing amounts

  • And whether they’ve said “boy, bye” to the company

We are going to clean and prepare the data so that it can be used to train an AI to predict if the customer is about to leave, based on their usage patterns and behaviour. But before we feed the data into the AI, we need to clean and prep the data

So in this blog, we’ll:

  • Take raw, messy customer data

  • Clean and transform it so machines can read it

  • Analyze patterns that affect churn

  • And prep everything for modeling (coming in Blog #2)

1.The Grand Unveiling: Peeking into the Data

We start with loading the Telco Customer Churn dataset ,a charming mess of 7043 rows and 21 columns, featuring everything from customer ID and billing details to whether someone has churned.

We look at:

  • head() : the first few rows, like the opening act of a concert.

  • info() : gives us data types, missing values, and nulls.

  • describe() : gives stats on numeric columns (mean, std, max, etc.)

  • Takeaway? Most of our data is stored as objects (i.e., strings), and one column (TotalCharges) is supposed to be numeric... but isn’t. Classic.

    2. Zeros and Missing Charges

    Some customers have a tenure of zero (they just joined), but somehow they already have a TotalCharges amount filled in.This is like someone saying they just entered a restaurant and already paid for dessert.

  • So we:

  • Convert TotalCharges to numeric using errors = ‘coerce’ to catch weird non-numeric entries.

Set those suspicious charge values to zero. If your tenure is 0, your charges better be too.

You see, AI models are like picky eaters. They only want numbers. So we convert columns like TotalCharges (which might be read as strings because of a few rogue spaces or nulls) into proper numeric types. This is like making sure your actor learns the script before filming starts. The suspicious TotalCharges value? turns out, they weren't real values at all, just pretending to be.

So next step,we call in our trusty friend fillna() to handle the blanks like a pro:

3.Making Categorical Data Behave

Let’s talk about the divas of our dataset, the ‘object’ columns. These are the text-based columns that describe your customers with words instead of numbers. You know, the drama queens like ‘Yes’,’No’,’Month-to-Month’,’Male’,’Female’, and so on.

Humans? We love words.
AI? Not so much.

To an AI model, Yes’ and ’No’are just arbitrary noise. ’Male’ and ’Female might as well be ‘Penguin’ and ‘DragonFruit’.Without numbers, it can’t find any pattern. So our next task is to transform the words into numeric data.

3.1 Label Encoding:

Label Encoding is ma straightforward transformation we use when the category has only two possible values.

We’re talking simple binaries like Yes/No or Male/Female

In Label Encoding, we assign:

  • 'Yes’→1, ‘No’→0

  • ‘Female’→1,’Male’→0

Why? Because that’s all the model needs to get the message: one thing means "on", the other means "off". Think of it like giving the AI a tiny cheat sheet instead of a novel.

3.2 One-Hot Encoding:

One-Hot Encoding is when categories demand their own stage, rhe high-maintenance columns ,the ones with more than two categories. These are your 'InternetService', 'Contract', 'PaymentMethod kind of columns. Label Encoding wouldn’t work here — because assigning numbers like 0, 1, 2 would trick the model into thinking there's an order or ranking. (Spoiler: there isn't.)

So what do we do?

We pull out the big guns: One-Hot Encoding. This technique creates a whole new column for each category — turning one column into several boolean flags (0 or 1). Each row gets a '1' in the column that matches its category, and '0' in all others.

Example:
If a customer has a ‘Two Year’ contract, they’ll get:

  • Contract_Month to Month: 0

  • Contract_One Year: 0

  • Contract_Two Years: 1

It’s like casting each category in its own role in a Broadway ensemble. No one’s the main character but they all get a moment to shine.

We also have to drop the deadweight. The columns which doesnt serve a purpose while training the AI Model like customer ID, customer Name etcdf = df.drop('customerID', axis=1)

df = df.drop('customerID', axis=1)

4.Exploratory Data Analysis (EDA)

Before you can build a model, you gotta meet your data like you're on a first date. You wouldn’t marry someone without asking questions first, right? Same goes for datasets.

EDA- Exploratory Data Analysis is that first date.
It’s when you take your data out for a coffee, ask some deep questions, get a sense of its vibes, red flags, weird habits, and possible secrets it’s hiding behind its columns.

In short?
It’s where you look before you leap into modeling.Now that the dataset speaks AI, it’s time to look for patterns. Who’s churning? Who’s sticking around?

Because if you skip it, your model could:

  • Learn nonsense relationships (“Oh, customers with 2 in their ID churn more!”)

  • Choke on missing values like it’s lactose intolerant

  • Think ‘Yes’ > ‘No’ because you fed it unencoded strings

  • Be biased, skewed, or just plain wrong

EDA is your early-warning system. It tells you:

  • What’s wrong with your data

  • What’s interesting about it

  • What might help your model learn better

And it helps you plan your next moves like a boss.

4.1 Class Imbalance Check

This is major when you’re predicting churn (or anything else binary).

sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()

And we realize: Oh wow, like 73% of people didn’t churn. That’s a huge imbalance.

That means the model could get lazy and just predict “no churn” for everyone and still get decent accuracy. We'll need to balance this later during modeling

4.2 Relationships Between Features

Now comes the spicy part,’Exploring relationships’.

We ask:

  • Does tenure impact churn?

  • Do people with higher monthly charges churn more?

  • Does the contract type make a difference?

So essentially, What makes a customer stick around versus ghost the service provider?


4.2.1 Tenure Vs Churn

“It’s not you, it’s... oh wait, maybe it is you.”

sns.boxplot(x='Churn', y='tenure', data=df)
plt.title('Tenure vs Churn')
plt.show()

We’re plotting the number of months (tenure) a customer has been with the company, separated by whether they churned or not

.

What we see:

  • People who left generally had a shorter tenure.

  • The longer someone has stayed, the less likely they are to churn.

Makes sense, right?
If I’ve been using the same Wi-Fi for 3 years, chances are I’ve either become numb to the pain or I’m on a long-term contract. Either way, I’m not leaving.


4.2.2 Monthly Charges vs. Churn and TotalCharges vs. Churn:

“Breakups are expensive.”

sns.boxplot(x='Churn', y='MonthlyCharges', data=df)
plt.title('Monthly Charges vs Churn')
plt.show()
sns.boxplot(x='Churn', y='TotalCharges', data=df)
plt.title('Total Charges vs Churn')
plt.show()

This boxplot is showing us how much people pay per month and how that relates to whether they churned

.What we see:

  • Customers with higher monthly charges are more likely to leave.

  • There’s a sharp rise in churn in the higher charge range.

The moral? People hate paying more than they feel something’s worth especially when Netflix isn’t loading during the season finale.


4.2.3 Contract Type vs. Churn:

“Commitment issues? We see you.”

Before we plot, we first reverse-engineer the one-hot encoded contract columns to get a readable version and then plot it.

df['ContractType'] = df[[
    'Contract_Month-to-month',
    'Contract_One year',
    'Contract_Two year'
]].idxmax(axis=1).str.replace('Contract_', '')

sns.countplot(x='ContractType', hue='Churn', data=df)
plt.title('Churn by Contract Type')
plt.show()

We’re checking what kind of contracts customers had and how often they churned.

What we see:

  • Month-to-month customers churn like it’s a trend.

  • Two-year contracts? Very little churn.

Contracts = commitment.
The less committed the contract, the more likely the customer is to peace out. (Relationship advice, anyone?)


4.2.4 Internet Service Type vs. Churn:

“No internet? No reason to stay.”

Same trick as before: first reverse-engineer the one-hot encoded contract columns to get a readable version and then plot it.

df['InternetType'] = df[[
    'InternetService_DSL',
    'InternetService_Fiber optic',
    'InternetService_No'
]].idxmax(axis=1).str.replace('InternetService_', '')

sns.countplot(x='InternetType', hue='Churn', data=df)
plt.title('Churn by Internet Service Type')
plt.show()

What we see:

  • Customers with fiber optic internet churn more. Possibly because it’s more expensive.

  • People with no internet service don’t churn, probably because there’s not much to churn from.

Hypothesis?
People using basic DSL are content, fiber optic users might feel it’s not worth the cost, and non-users are probably just keeping the phone line active.


4.2.5 Payment Method vs. Churn:

“How you pay = how likely you’ll stay.”

You know the deal with reverse-engineering the one-hot coded data


df['PaymentType'] = df[[
    'PaymentMethod_Electronic check',
    'PaymentMethod_Mailed check',
    'PaymentMethod_Bank transfer (automatic)',
    'PaymentMethod_Credit card (automatic)'
]].idxmax(axis=1).str.replace('PaymentMethod_', '')

sns.countplot(x='PaymentType', hue='Churn', data=df)
plt.title('Churn by Payment Method')
plt.xticks(rotation=30)
plt.show()

We’re asking: does your payment method influence your loyalty?

What we see:

  • Electronic check users churn a lot more.

  • Auto payments (bank or credit) are associated with lower churn.

Insight: Friction in the payment process might push people away. Autopay = autopilot = they forget they’re even paying.


Our Dataset's Red Flags

FeatureHigh Churn Risk Group
TenureShort-term customers
Monthly ChargesHigh spenders
Contract TypeMonth-to-month
Internet TypeFiber optic users
Payment MethodElectronic check

4.3 Correlation Heatmap

Before we send our AI model off to make predictions, let’s eavesdrop on the relationships between features. Who’s secretly flirting with Churn behind the scenes?

Enter: The Correlation Heatmap

Think of it like this: we’ve invited all our features to a party. The correlation heatmap tells us who’s vibing (positively correlated), who’s got beef (negatively correlated), and who’s just awkwardly ignoring each other (no correlation).

The values range from:

  • +1: besties - as one goes up, so does the other

  • -1: sworn enemies - as one goes up, the other goes down

  • 0: total strangers

    Colors help:

    • Red: strong positive bond

    • Blue: strong negative bond

    • White-ish: meh, neutral

But we care about Churn. So who’s whispering secrets in Churn’s ear?

plt.figure(figsize=(14,10))
sns.heatmap(df.corr(numeric_only=True), cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

What the Heatmap Shows

Churn is:

Negatively correlated with:

  • Tenure (−0.35): People who stick around longer tend not to churn. Obvious, but good to confirm.

  • Contract_two year and Contract_One Yea: Long contracts = less churn.

  • OnlineSecurity (−0.28) & TechSupport (−0.26): Customers with support/security are way less likely to leave.

Positively correlated with:

  • MonthlyCharges (+0.19): Higher bills lead to higher churn. No surprises there.

  • Contract_Month-to-Month: These folks churn more, because they can.

    Other Things Worth Noting:

    • Tenure and TotalCharges: Very positively correlated (0.83) → Makes sense: longer tenure, more total money.

    • OnlineSecurity, TechSupport, and DeviceProtection are strongly related to each other → They could be bundled services.

    • InternetService_FibreOptics has a positive correlation with churn, while InternetService_No has a negative correlation → People with fiber seem to leave more, maybe due to expectations vs reality?

      This heatmap is your AI intuition cheat code:

      • It tells you which features actually matter for churn prediction.

      • Helps you avoid collinearity traps in modeling (don’t feed the model redundant info).

      • Gives you ammo for feature selection, hypothesis testing, and feature engineering.

Conclusion:

We started this journey with a messy pile of raw customer data, kinda like the "before" scene in every teen movie. But after some heavy-duty cleaning, encoding, and a glam makeover (thanks to EDA), our dataset is finally red-carpet ready for machine learning.

We’ve:

  • Ironed out missing values and fixed shady datatypes.

  • Trained our categorical rebels (objects) to behave using encoding.

  • Dug deep with EDA to spot what really drives customers to leave.

  • Found out long-term users and 2-year contract folks are loyal ride-or-dies.

  • Confirmed that high bills and no tech support are a recipe for churn drama.

  • And we visualized it all with heatmaps and boxplots that spilled all the tea.

Now, our data’s not just cleaned, it’s insightful. It knows who's ghosting and why. So the next time we build a churn prediction model, we’ll have more than just numbers we’ll have context, correlations, and clues.

In the next blog, we’ll move from exploration to prediction. We'll train a range of models to see how well we can forecast customer churn, evaluate their performance, and even interpret which features most influence churn behavior.

P.S. If you’re curious about the code behind all this magic, I’ve got it all up on [GitHub].
Go snoop, fork, or run it yourself!

0
Subscribe to my newsletter

Read articles from Thejaswini Arun directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Thejaswini Arun
Thejaswini Arun