Learning Pandas: From Clueless to Curious in One Read

Priyesh ShahPriyesh Shah
6 min read

Welcome to your ultimate starting guide for data analysis in Python! If you’ve ever wanted to explore large datasets and uncover hidden insights, you’re in the right place. Today, we’re diving deep into Pandas, the most essential Python library for data manipulation.

We’ll walk through everything from loading your first dataset to asking complex questions of it. We’ll use the real-world Stack Overflow Developer Survey as our playground, combining concepts from top tutorials with hands-on code you can run yourself. Let’s get started!

1) Getting Started — Loading & Inspecting Your First Dataset

Before we can analyze anything, we need data. The first step is always to load our data into a Pandas DataFrame. A DataFrame is the core of Pandas — think of it as a smart spreadsheet or a table with rows and columns.

First, let’s import the library (the as pd is a standard convention) and load our survey data from a CSV file.

import pandas as pd
# Load the main survey results
df = pd.read_csv('survey_results_public.csv')

Great! Our data is now in a DataFrame called df. But what does it look like? How big is it? Let's do some initial inspection.

  • Check the size with .shape: This attribute shows you the dimensions in a (rows, columns) format.
df.shape # Output: (88883, 85)
  • That’s quite a lot, 88,883 rows and 85 columns!

  • Get a technical summary with .info(): This method gives a breakdown of each column, its data type, and how many non-null values it contains. It's perfect for a quick overview.

df.info()
  • Look at whatever part of data you want with .head() and .tail(): You don't want to print all 88,000 rows. Use .head() to see the first few rows and .tail() to see the last few.
# See the first 5 rows 
df.head()  
# See the last 10 rows 
df.tail(10)

Pro Tip: With 85 columns, Pandas will hide some from view. To see them all, you can change the display options:

pd.set_option('display.max_columns', 85)
pd.set_option('display.max_row', 85)

2) Selecting the Data You Need — Columns, Rows & Slicing 🎯

A DataFrame is just a collection of Series. You can think of a Series as a single column of data. Most of the time, you’ll want to work with specific columns or rows.

Selecting Columns

You can grab a single column (a Series) using bracket notation, just like with a Python dictionary or To select multiple columns, pass a list of column names. This will return a new, smaller DataFrame.

# Get the 'Hobbyist' column
df['Hobbyist']
# Get the Country and Education Level columns
df[['Country', 'EdLevel']]

Selecting Rows with .loc and .iloc

Pandas gives us two primary ways to select rows:

  • .iloc (integer location): Selects rows based on their integer position (e.g., the 1st row, 5th row, etc.).

  • .loc (label location): Selects rows based on their index label.

Let’s see it in action:

# Get the first row of data using its integer position
df.iloc[0]
# Get the first three rows
df.iloc[0:3]
# Get the first row using its index label (which is also 0 by default)
df.loc[0]
# Get the first three rows by label
df.loc[0:3]

Right now, .loc and .iloc seem to do the same thing because our default index is just integers. But what if we had a more meaningful index?

3) The Power of the Index — Making Data Searchable 🔎

The index is the identifier for each row. While the default integer index works, we can make our data much easier to search by setting a more meaningful index. The ‘Respondent’ column in our data contains a unique ID for each person. Let’s make that our index!

You can do this right when you load the data using the index_col argument. This is super efficient.

# Load data and set 'Respondent' as the index immediately
df = pd.read_csv('survey_results_public.csv', index_col='QName')

Now, our DataFrame is indexed by the respondent’s ID. This makes .loc incredibly powerful because we can now fetch rows by this unique ID.

# Get the full survey response for the person with Respondent ID 1
df.loc[1]
df.loc[1,'question'] # We can deepen down our search even more

If you ever want to change the index back to the default, you can use reset_index(). To make your index easier to search, you can also sort it with sort_index().

4) Asking Questions — Filtering Your Data Like a Pro

This is where data analysis truly begins. Filtering is how we ask questions and pull out specific subsets of data. The process involves creating a “filter mask” — a Series of True/False values—and applying it to our DataFrame.

Let’s find all the developers from India.

  • Create the filter mask: This line doesn’t return the data itself, but a Series where True marks a row where the 'Country' is 'India'.
filt = (df['Country'] == 'India')
# Or finding any other details you want from your dataframe
years_code = (df['YearsCode'] > "5")
  • Apply the filter with .loc: Now, we use our filter inside .loc to get all the rows that match.
df.loc[filt]

# You can also combine these filters
combined_filt = years_code & filt 
# And Print these with the specific details you want from the dataframe
df.loc[combined_filt,['Age','DevType','LanguageHaveWorkedWith']]
  • And just like that, you have a DataFrame containing only the survey respondents from India!

Combining & Negating Filters

What if you have multiple conditions?

  • Use the AND operator (&) when all conditions must be true.

  • Use the OR operator (|) when at least one condition must be true.

  • Use the tilde (~) to negate a filter (get everything that doesn't match).

Let’s find all the developers from the United States who are also hobbyist coders.

# Note the parentheses around each condition
us_hobbyist_filt = (df['Country'] == 'United States') & (df['Hobbyist'] == 'Yes')
df.loc[us_hobbyist_filt]

To get everyone not from the United States, we could do:

# This neglets and thoes the opposite of what we want so we would get 
# everyone other than United States
df.loc[~(df['Country'] == 'United States')]

Advanced Filtering

  • Filtering by a list with .isin(): To find respondents from a list of countries (e.g., India, Germany, or the UK), .isin() is much cleaner than a long OR chain.
countries = ['India', 'Germany', 'United Kingdom'] 
country_filt = df['Country'].isin(countries) 
df.loc[country_filt]
  • Filtering strings with .str.contains(): Want to find every respondent who mentioned 'Python' in their LanguageWorkedWith response? .str.contains() is perfect for this.
# na=False handles any missing values to avoid errors 
python_filt = df['LanguageWorkedWith'].str.contains('Python', na=False) 
df.loc[python_filt]

Conclusion

And there you have it! You’ve gone from loading a raw CSV file to inspecting it, selecting specific data, creating a powerful index, and asking complex questions with advanced filtering. These are the fundamental building blocks of almost every data analysis project you’ll ever encounter.

The best way to learn is by doing. Try asking your own questions of the Stack Overflow dataset. With Pandas, you now have the tools to find out. Stay tuned for Part 2, where we’ll cover modifying data, handling missing values, and much more. Happy coding!

0
Subscribe to my newsletter

Read articles from Priyesh Shah directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Priyesh Shah
Priyesh Shah

Hi there👋 I'm Priyesh Shah 💻 B.Tech Computer Engineering student (2028) 🐍 Python, exploring AI/ML and open-source 🚀 Building projects to contribute to GSoC 2026