Getting Started with Pandas: The Key Functions You Need to Know
Pandas is a powerful Python library used for data analysis and manipulation. It is a crucial tool in Exploratory Data Analysis (EDA), which is a fundamental step in machine learning. Pandas offers numerous built-in functions that enable faster and more efficient data processing. In this article, we will be covering some of the most commonly used functions when starting to analyze datasets.
You can also play around with the Kaggle Notbook where you can execute and experiment with already dataset
https://www.kaggle.com/code/muhammadfahadbashir/pandas-tutorial-basic-functions
Following are some of the functions
Importing Pandas
Creating a DataFrame
Head & Tail (First and Last Rows)
Info (Concise Summary of DataFrame Structure & Information)
Describe (Descriptive Statistics of DataFrame)
Selecting Columns
Selecting Rows
Filtering Data Based on Conditions (Single or Multiple Conditions)
Adding a New Column
Dropping a Column
Dropping Null Values —
dropna()
1. Importing Pandas
To use Pandas, We first need to import the library. This is typically done with
import pandas as pd
2. Creating a DataFrame
A DataFrame is a tabular data structure with labeled axes (rows and columns). We can create a DataFrame from various data structures such as lists, dictionaries, or another DataFrame:
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32]}
df = pd.DataFrame(data)
print(df)
3. Head & Tail (First and Last Rows)
To preview the first and last rows of the DataFrame, use the head()
and tail()
functions respectively:
print(df.head())
print(df.tail())
4. Info (Concise Summary of DataFrame Structure & Information)
The info()
function provides a concise summary of the DataFrame, including the number of non-null entries and data types of each column:
print(df.info())
5. Describe (Descriptive Statistics of DataFrame)
The describe()
function generates descriptive statistics that summarize the quartiles,median, and shape of a dataset’s distribution, excluding NaN values:
print(df.describe())
6. Selecting Columns
We can select a specific column from the DataFrame using its column name:
print(df['Name'])
7. Selecting Rows
To select specific rows, use the iloc
and loc
functions. iloc
is used for integer-location-based indexing, while loc
is used for label-based indexing:
print(df.iloc[0]) # First row
print(df.loc[0]) # Row with index 0
print(df.loc[5:10]) # will print from row 5 to row 10 (Total 6 rows)
8. Filtering Data Based on Conditions
We can filter data based on specific conditions, using single or multiple criteria:
print(df[df['Age'] > 30]) # Single condition
print(df[(df['Age'] > 30) & (df['Name'] == 'Linda')]) # Multiple conditions
9. Adding a New Column
To add a new column to the DataFrame, you can simply assign values to a new column name. But this method will ony work if the length of the list matches the number of rows in the DataFrame
df['City'] = ['New York', 'Paris', 'Berlin', 'London']
print(df)
10. Dropping a Column
To drop a column, use the drop()
function, specifying the column name and the axis:
df = df.drop('City', axis=1)
print(df)
11. Dropping Null Values — dropna()
To remove rows with missing values, use the dropna()
function:
df = df.dropna()
print(df)
Final Remarks
Understanding and using these basic Pandas functions will significantly enhance the ability to manipulate and analyze data efficiently. Mastering these foundational skills is essential for any data scientist or analyst working with Python.
Subscribe to my newsletter
Read articles from Muhammad Fahad Bashir directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by