Learning Pandas: From Clueless to Curious in One Read


Welcome to your ultimate starting guide for data analysis in Python! If you’ve ever wanted to explore large datasets and uncover hidden insights, you’re in the right place. Today, we’re diving deep into Pandas, the most essential Python library for data manipulation.
We’ll walk through everything from loading your first dataset to asking complex questions of it. We’ll use the real-world Stack Overflow Developer Survey as our playground, combining concepts from top tutorials with hands-on code you can run yourself. Let’s get started!
1) Getting Started — Loading & Inspecting Your First Dataset
Before we can analyze anything, we need data. The first step is always to load our data into a Pandas DataFrame. A DataFrame is the core of Pandas — think of it as a smart spreadsheet or a table with rows and columns.
First, let’s import the library (the as pd
is a standard convention) and load our survey data from a CSV file.
import pandas as pd
# Load the main survey results
df = pd.read_csv('survey_results_public.csv')
Great! Our data is now in a DataFrame called df
. But what does it look like? How big is it? Let's do some initial inspection.
- Check the size with
.shape
: This attribute shows you the dimensions in a(rows, columns)
format.
df.shape # Output: (88883, 85)
That’s quite a lot, 88,883 rows and 85 columns!
Get a technical summary with
.info()
: This method gives a breakdown of each column, its data type, and how many non-null values it contains. It's perfect for a quick overview.
df.info()
- Look at whatever part of data you want with
.head()
and.tail()
: You don't want to print all 88,000 rows. Use.head()
to see the first few rows and.tail()
to see the last few.
# See the first 5 rows
df.head()
# See the last 10 rows
df.tail(10)
Pro Tip: With 85 columns, Pandas will hide some from view. To see them all, you can change the display options:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_row', 85)
2) Selecting the Data You Need — Columns, Rows & Slicing 🎯
A DataFrame is just a collection of Series. You can think of a Series as a single column of data. Most of the time, you’ll want to work with specific columns or rows.
Selecting Columns
You can grab a single column (a Series) using bracket notation, just like with a Python dictionary or To select multiple columns, pass a list of column names. This will return a new, smaller DataFrame.
# Get the 'Hobbyist' column
df['Hobbyist']
# Get the Country and Education Level columns
df[['Country', 'EdLevel']]
Selecting Rows with .loc
and .iloc
Pandas gives us two primary ways to select rows:
.iloc
(integer location): Selects rows based on their integer position (e.g., the 1st row, 5th row, etc.)..loc
(label location): Selects rows based on their index label.
Let’s see it in action:
# Get the first row of data using its integer position
df.iloc[0]
# Get the first three rows
df.iloc[0:3]
# Get the first row using its index label (which is also 0 by default)
df.loc[0]
# Get the first three rows by label
df.loc[0:3]
Right now, .loc
and .iloc
seem to do the same thing because our default index is just integers. But what if we had a more meaningful index?
3) The Power of the Index — Making Data Searchable 🔎
The index is the identifier for each row. While the default integer index works, we can make our data much easier to search by setting a more meaningful index. The ‘Respondent’ column in our data contains a unique ID for each person. Let’s make that our index!
You can do this right when you load the data using the index_col
argument. This is super efficient.
# Load data and set 'Respondent' as the index immediately
df = pd.read_csv('survey_results_public.csv', index_col='QName')
Now, our DataFrame is indexed by the respondent’s ID. This makes .loc
incredibly powerful because we can now fetch rows by this unique ID.
# Get the full survey response for the person with Respondent ID 1
df.loc[1]
df.loc[1,'question'] # We can deepen down our search even more
If you ever want to change the index back to the default, you can use reset_index()
. To make your index easier to search, you can also sort it with sort_index()
.
4) Asking Questions — Filtering Your Data Like a Pro
This is where data analysis truly begins. Filtering is how we ask questions and pull out specific subsets of data. The process involves creating a “filter mask” — a Series of True
/False
values—and applying it to our DataFrame.
Let’s find all the developers from India.
- Create the filter mask: This line doesn’t return the data itself, but a Series where
True
marks a row where the 'Country' is 'India'.
filt = (df['Country'] == 'India')
# Or finding any other details you want from your dataframe
years_code = (df['YearsCode'] > "5")
- Apply the filter with
.loc
: Now, we use our filter inside.loc
to get all the rows that match.
df.loc[filt]
# You can also combine these filters
combined_filt = years_code & filt
# And Print these with the specific details you want from the dataframe
df.loc[combined_filt,['Age','DevType','LanguageHaveWorkedWith']]
- And just like that, you have a DataFrame containing only the survey respondents from India!
Combining & Negating Filters
What if you have multiple conditions?
Use the AND operator (
&
) when all conditions must be true.Use the OR operator (
|
) when at least one condition must be true.Use the tilde (
~
) to negate a filter (get everything that doesn't match).
Let’s find all the developers from the United States who are also hobbyist coders.
# Note the parentheses around each condition
us_hobbyist_filt = (df['Country'] == 'United States') & (df['Hobbyist'] == 'Yes')
df.loc[us_hobbyist_filt]
To get everyone not from the United States, we could do:
# This neglets and thoes the opposite of what we want so we would get
# everyone other than United States
df.loc[~(df['Country'] == 'United States')]
Advanced Filtering
- Filtering by a list with
.isin()
: To find respondents from a list of countries (e.g., India, Germany, or the UK),.isin()
is much cleaner than a longOR
chain.
countries = ['India', 'Germany', 'United Kingdom']
country_filt = df['Country'].isin(countries)
df.loc[country_filt]
- Filtering strings with
.str.contains()
: Want to find every respondent who mentioned 'Python' in theirLanguageWorkedWith
response?.str.contains()
is perfect for this.
# na=False handles any missing values to avoid errors
python_filt = df['LanguageWorkedWith'].str.contains('Python', na=False)
df.loc[python_filt]
Conclusion
And there you have it! You’ve gone from loading a raw CSV file to inspecting it, selecting specific data, creating a powerful index, and asking complex questions with advanced filtering. These are the fundamental building blocks of almost every data analysis project you’ll ever encounter.
The best way to learn is by doing. Try asking your own questions of the Stack Overflow dataset. With Pandas, you now have the tools to find out. Stay tuned for Part 2, where we’ll cover modifying data, handling missing values, and much more. Happy coding!
Subscribe to my newsletter
Read articles from Priyesh Shah directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Priyesh Shah
Priyesh Shah
Hi there👋 I'm Priyesh Shah 💻 B.Tech Computer Engineering student (2028) 🐍 Python, exploring AI/ML and open-source 🚀 Building projects to contribute to GSoC 2026