Getting Started with Pandas: The Key Functions You Need to Know

Pandas is a powerful Python library used for data analysis and manipulation. It is a crucial tool in Exploratory Data Analysis (EDA), which is a fundamental step in machine learning. Pandas offers numerous built-in functions that enable faster and more efficient data processing. In this article, we will be covering some of the most commonly used functions when starting to analyze datasets.

You can also play around with the Kaggle Notbook where you can execute and experiment with already dataset

https://www.kaggle.com/code/muhammadfahadbashir/pandas-tutorial-basic-functions

Following are some of the functions

  1. Importing Pandas

  2. Creating a DataFrame

  3. Head & Tail (First and Last Rows)

  4. Info (Concise Summary of DataFrame Structure & Information)

  5. Describe (Descriptive Statistics of DataFrame)

  6. Selecting Columns

  7. Selecting Rows

  8. Filtering Data Based on Conditions (Single or Multiple Conditions)

  9. Adding a New Column

  10. Dropping a Column

  11. Dropping Null Values — dropna()

1. Importing Pandas

To use Pandas, We first need to import the library. This is typically done with

import pandas as pd

2. Creating a DataFrame

A DataFrame is a tabular data structure with labeled axes (rows and columns). We can create a DataFrame from various data structures such as lists, dictionaries, or another DataFrame:

data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32]}
df = pd.DataFrame(data)
print(df)

3. Head & Tail (First and Last Rows)

To preview the first and last rows of the DataFrame, use the head() and tail() functions respectively:

print(df.head())
print(df.tail())

4. Info (Concise Summary of DataFrame Structure & Information)

The info() function provides a concise summary of the DataFrame, including the number of non-null entries and data types of each column:

print(df.info())

5. Describe (Descriptive Statistics of DataFrame)

The describe() function generates descriptive statistics that summarize the quartiles,median, and shape of a dataset’s distribution, excluding NaN values:

print(df.describe())

6. Selecting Columns

We can select a specific column from the DataFrame using its column name:

print(df['Name'])

7. Selecting Rows

To select specific rows, use the iloc and loc functions. iloc is used for integer-location-based indexing, while loc is used for label-based indexing:

print(df.iloc[0]) # First row
print(df.loc[0]) # Row with index 0

print(df.loc[5:10]) # will print from row 5 to row 10 (Total 6 rows)

8. Filtering Data Based on Conditions

We can filter data based on specific conditions, using single or multiple criteria:

print(df[df['Age'] > 30]) # Single condition
print(df[(df['Age'] > 30) & (df['Name'] == 'Linda')]) # Multiple conditions

9. Adding a New Column

To add a new column to the DataFrame, you can simply assign values to a new column name. But this method will ony work if the length of the list matches the number of rows in the DataFrame

df['City'] = ['New York', 'Paris', 'Berlin', 'London']
print(df)

10. Dropping a Column

To drop a column, use the drop() function, specifying the column name and the axis:

df = df.drop('City', axis=1)
print(df)

11. Dropping Null Values — dropna()

To remove rows with missing values, use the dropna() function:

df = df.dropna()
print(df)

Final Remarks

Understanding and using these basic Pandas functions will significantly enhance the ability to manipulate and analyze data efficiently. Mastering these foundational skills is essential for any data scientist or analyst working with Python.

0
Subscribe to my newsletter

Read articles from Muhammad Fahad Bashir directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muhammad Fahad Bashir
Muhammad Fahad Bashir