Converting CSV to DataFrame in Python
The article is maintained by the team at commabot.
To convert a CSV file to a DataFrame in Python, we can use the pandas
library. Here's a step-by-step guide to doing this:
Install Pandas: If you haven't already installed pandas, you can do so by running the following command in your terminal or command prompt:
pip install pandas
Read the CSV File: Use the pd.read
_csv()
function to read your CSV file and convert it into a DataFrame. You need to specify the path to your CSV file as the function's argument. Optionally, you can specify other parameters to handle different data formats, such as delimiter, column names, and encoding.
Basic usage:
With optional parameters (for example, specifying a delimiter and skipping rows):
df = pandas.read_csv('path/to/your/file.csv', delimiter=';', skiprows=1)
Use the DataFrame: Once you have read the CSV into a DataFrame, you can start using various pandas functionalities to analyze and manipulate your data. For example, you can view the first few rows of the DataFrame with df.head()
.
# Import pandas library
import pandas as pd
# Read the CSV file
df = pd.read_csv('path/to/your/file.csv')
# Display the first 5 rows of the DataFrame
print(df.head())
Now let's look at a more comprehensive example. Imagine you have a CSV file named sample_data.csv
with the following content:
Name,Age,Salary,Department
John Doe,28,50000,Marketing
Jane Smith,,55000,Finance
Emily Jones,22,,HR
Michael Brown,30,60000,IT
Alex Johnson,25,52000,Marketing
Notice that some data points are missing (indicated by empty fields).
Let's write a script that does the following:
Read CSV: The script starts by reading the
sample_data.csv
file into a DataFrame.Handle Missing Values: It fills missing
Age
values with the average age and drops rows whereSalary
is missing.Filter Data: It creates a new DataFrame (
marketing_df
) containing only rows for employees in the Marketing department.Compute Statistics: It calculates the average, maximum, and minimum salaries across the entire dataset.
Save to CSV: Finally, it saves the modified DataFrame (after handling missing values and filtering) to a new CSV file named
modified_sample_data.csv
, excluding the index column.
import pandas as pd
# Read the CSV file
df = pd.read_csv('sample_data.csv')
# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())
# Handling missing values
# Fill missing ages with the average age
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Drop rows where 'Salary' is missing
df.dropna(subset=['Salary'], inplace=True)
# Filter data: Select only employees in the Marketing department
marketing_df = df[df['Department'] == 'Marketing']
# Compute basic statistics for the Salary column
average_salary = df['Salary'].mean()
max_salary = df['Salary'].max()
min_salary = df['Salary'].min()
print("\nFiltered DataFrame (Marketing Department):")
print(marketing_df)
print("\nSalary Statistics:")
print(f"Average Salary: {average_salary}")
print(f"Maximum Salary: {max_salary}")
print(f"Minimum Salary: {min_salary}")
# Save the modified DataFrame to a new CSV file
df.to_csv('modified_sample_data.csv', index=False)
This example demonstrates basic data manipulation tasks with pandas, including cleaning data, filtering based on conditions, and computing statistics, which are common steps in data analysis workflows.
Subscribe to my newsletter
Read articles from commabot directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
commabot
commabot
Researching and writing articles about document processing.