20 Polars concepts with Before-and-After Examples
Table of contents
- 1. Creating DataFrames (pl.DataFrame) ποΈ
- 2. Selecting Columns (df.select) π
- 3. Filtering Rows (df.filter) π
- 4. Adding Columns (df.with_columns) β
- 5. Grouping Data (df.groupby) π
- 6. Sorting Data (df.sort) π’
- 7. Joining DataFrames (df.join) π
- 8. Pivoting Data (df.pivot) π
- 9. Lazy Evaluation (df.lazy) π€
- 10. Exploratory Data Analysis (df.describe) π
- 11. Renaming Columns (df.rename) π€
- 12. Handling Null Values (df.fill_null) π«
- 13. Concatenating DataFrames (pl.concat) π οΈ
- 14. Window Functions (df.select with .over) π
- 15. Date and Time Manipulation (pl.date_range, pl.col.dt) π
- 16. Cumulative Operations (df.select with .cumsum) β
- 17. Melt (Wide to Long Format) π
- 18. Pivot (Long to Wide Format) π
- 19. Exploding Columns (df.explode) π₯
- 20. Reversing Columns (df.select with .reverse) π
1. Creating DataFrames (pl.DataFrame) ποΈ
Boilerplate Code:
import polars as pl
Use Case: Create a DataFrame to hold your data, similar to pandas. ποΈ
Goal: Store and manipulate data in a high-performance DataFrame structure. π―
Sample Code:
# Create a DataFrame
df = pl.DataFrame({
"column1": [1, 2, 3],
"column2": ["A", "B", "C"]
})
# Now you have a DataFrame with two columns!
Before Example: has raw data but no structure to manipulate it easily. π€
Data: [1, 2, 3], ["A", "B", "C"]
After Example: With pl.DataFrame(), the data is structured and ready to work with! ποΈ
DataFrame:
shape: (3, 2)
βββββββββββ¬ββββββββββ
β column1 β column2 β
βββββββββββΌββββββββββ€
β 1 β A β
β 2 β B β
β 3 β C β
βββββββββββ΄ββββββββββ
Challenge: π Try creating a DataFrame with more columns and different data types.
2. Selecting Columns (df.select) π
Boilerplate Code:
df.select("column_name")
Use Case: Select specific columns from your DataFrame to work with or analyze. π
Goal: Extract just the columns you need from the DataFrame. π―
Sample Code:
# Select a single column
df.select("column1")
# Select multiple columns
df.select(["column1", "column2"])
Before Example: has a large DataFrame but only needs specific columns for their analysis. π€
DataFrame: all columns.
After Example: With df.select(), only the needed columns are extracted! π
Selected Columns: ["column1"]
Challenge: π Try selecting columns that match a specific pattern using wildcards like "column_*"
.
3. Filtering Rows (df.filter) π
Boilerplate Code:
df.filter(pl.col("column_name") > value)
Use Case: Filter rows based on a condition to narrow down your data. π
Goal: Extract only the rows that meet certain criteria. π―
Sample Code:
# Filter rows where column1 > 1
df.filter(pl.col("column1") > 1)
Before Example: has a DataFrame with all the rows but only needs a subset that matches specific criteria. π€
DataFrame: all rows.
After Example: With df.filter(), only the rows that meet the condition are included! π
Filtered DataFrame: Rows where column1 > 1.
Challenge: π Try filtering using multiple conditions (e.g., column1 > 1
and column2 == "B"
).
4. Adding Columns (df.with_columns) β
Boilerplate Code:
df.with_columns([pl.col("existing_column") * 2])
Use Case: Add a new column based on existing columns in the DataFrame. β
Goal: Extend your DataFrame by creating new columns derived from existing data. π―
Sample Code:
# Add a new column that multiplies column1 by 2
df.with_columns([
(pl.col("column1") * 2).alias("new_column")
])
Before Example: Need to add a new column but doesnβt know how. π€·ββοΈ
DataFrame: columns1, column2
After Example: With df.with_columns(), the DataFrame now has an additional column! β
New Column: 'new_column' added.
Challenge: π Try adding a column that combines values from multiple existing columns.
5. Grouping Data (df.groupby) π
Boilerplate Code:
df.groupby("column_name").agg([pl.col("another_column").mean()])
Use Case: Group data by a column and apply aggregate functions like mean, sum, or count. π
Goal: Summarize data by grouping similar entries and applying calculations. π―
Sample Code:
# Group by column2 and calculate the mean of column1
df.groupby("column2").agg([
pl.col("column1").mean()
])
Before Example: has unsummarized data and wants to compute statistics for each group. π€
DataFrame: ungrouped data.
After Example: With df.groupby(), the data is grouped, and the mean is calculated for each group! π
Grouped DataFrame: mean of column1 by column2.
Challenge: π Try grouping by multiple columns and applying multiple aggregate functions.
6. Sorting Data (df.sort) π’
Boilerplate Code:
df.sort("column_name", reverse=True)
Use Case: Sort the DataFrame by one or more columns, either in ascending or descending order. π’
Goal: Rearrange your data for better visualization or analysis. π―
Sample Code:
# Sort by column1 in descending order
df.sort("column1", reverse=True)
Before Example: data is unsorted, making it harder to analyze. π€
DataFrame: unsorted.
After Example: With df.sort(), the data is sorted in the desired order! π’
Sorted DataFrame: column1 sorted in descending order.
Challenge: π Try sorting by multiple columns, with one in ascending and the other in descending order.
7. Joining DataFrames (df.join) π
Boilerplate Code:
df1.join(df2, on="column_name", how="inner")
Use Case: Join two DataFrames on a common column to combine data. π
Goal: Merge data from two sources based on shared columns. π―
Sample Code:
# Perform an inner join on column1
df1.join(df2, on="column1", how="inner")
Before Example: has two separate DataFrames but needs to merge them into one. π€·ββοΈ
Two separate DataFrames.
After Example: With df.join(), the DataFrames are now combined into one! π
Joined DataFrame: merged on column1.
Challenge: π Try different join types like how="left"
or how="outer"
to see how the output changes.
8. Pivoting Data (df.pivot) π
Boilerplate Code:
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Use Case: Pivot your data to reshape it, turning unique values into columns. π
Goal: Rearrange your DataFrame from long format to wide format. π―
Sample Code:
# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Before Example: has data in long format but wants to transform it into a more readable structure. π€
Long format DataFrame: stacked rows.
After Example: With df.pivot(), the data is reshaped into a more readable format! π
Pivoted DataFrame: values turned into columns.
Challenge: π Try applying different aggregation functions during pivoting (e.g., sum or mean).
9. Lazy Evaluation (df.lazy) π€
Boilerplate Code:
df.lazy().select(...)
Use Case: Use lazy evaluation to defer execution of operations until explicitly needed, improving performance. π€
Goal: Chain multiple operations without executing them immediately. π―
Sample Code:
# Use lazy evaluation
df.lazy().filter(pl.col("column1") > 1).select("column2").collect()
Before Example: Run every operation immediately, slowing down performance with large datasets. π’
Eager
evaluation: every step executed immediately.
After Example: With lazy evaluation, operations are deferred and executed in one go! π€
Lazy evaluation: operations executed only when collected.
Challenge: π Try chaining multiple operations together and see the performance improvement.
10. Exploratory Data Analysis (df.describe) π
Boilerplate Code:
df.describe()
Use Case: Get a quick summary of your DataFrame for exploratory data analysis. π
Goal: View statistics like count, mean, and standard deviation for numeric columns. π―
Sample Code:
# Get summary statistics
df.describe()
Before Example: Has a DataFrame but no quick overview of its statistics. π€
DataFrame: raw data, no summary.
After Example: With df.describe(), the intern gets a useful summary of the data! π
DataFrame Summary: count, mean, std, min, max.
Challenge: π Try getting descriptive statistics for specific columns only.
11. Renaming Columns (df.rename) π€
Boilerplate Code:
df.rename({"old_column": "new_column"})
Use Case: Rename columns in your DataFrame for clarity or consistency. π€
Goal: Change column names to something more descriptive or standardized. π―
Sample Code:
# Rename a column
df.rename({"column1": "renamed_column1"})
Before Example: has unclear or inconsistent column names. π€
Columns: ["column1", "column2"]
After Example: With df.rename(), the columns have more descriptive names! π€
Renamed Columns: ["renamed_column1", "column2"]
Challenge: π Try renaming multiple columns at once.
12. Handling Null Values (df.fill_null) π«
Boilerplate Code:
df.fill_null("default_value")
Use Case: Handle missing values by filling them with a default value. π«
Goal: Replace null or missing values with something meaningful to avoid issues in analysis. π―
Sample Code:
# Fill null values with a default value
df.fill_null(0)
Before Example: DataFrame has null values that can cause errors in calculations. π¬
DataFrame: some rows with null values.
After Example: With df.fill_null(), the missing values are filled in with a default! π«
Filled DataFrame: null values replaced with 0.
Challenge: π Try using different fill strategies like filling with the column mean or forward filling (ffill
).
13. Concatenating DataFrames (pl.concat) π οΈ
Boilerplate Code:
pl.concat([df1, df2], how="vertical")
Use Case: Concatenate two or more DataFrames either vertically or horizontally. π οΈ
Goal: Combine multiple DataFrames into one for easier analysis. π―
Sample Code:
# Concatenate two DataFrames vertically
pl.concat([df1, df2], how="vertical")
Before Example: has separate DataFrames but wants to combine them. π€·ββοΈ
Two separate DataFrames: df1, df2
After Example: With pl.concat(), the DataFrames are now combined! π οΈ
Concatenated DataFrame: df1 and df2 merged.
Challenge: π Try horizontal concatenation by setting how="horizontal"
.
14. Window Functions (df.select with .over) π
Boilerplate Code:
df.select([pl.col("column").sum().over("group_column")])
Use Case: Apply window functions like moving averages or rolling sums within groups. π
Goal: Perform calculations over a subset of rows based on a window. π―
Sample Code:
# Apply a rolling sum over groups
df.select([pl.col("column1").sum().over("column2")])
Before Example: needs to calculate a sum or average for rows within each group. π€
Grouped DataFrame: no summary within groups.
After Example: With window functions, the intern can compute rolling calculations within groups! π
Windowed DataFrame: sum of column1 over groups in column2.
Challenge: π Try using other window functions like .mean()
or .max()
.
15. Date and Time Manipulation (pl.date_range, pl.col.dt) π
Boilerplate Code:
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")
Use Case: Manipulate date and time data, such as creating date ranges or extracting parts of a date. π
Goal: Work with date data to generate time-series data or extract specific parts like year or month. π―
Sample Code:
# Create a date range
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")
# Extract year from a date column
df.with_columns(pl.col("date_column").dt.year())
Before Example: has date data but struggles with extracting specific date components. π€
Dates: full date format (YYYY-MM-DD).
After Example: With date manipulation, the intern can now extract or manipulate specific date components! π
Extracted Year: [2023, 2023, ...]
Challenge: π Try extracting other parts of the date like month, day, or week using .dt
.
16. Cumulative Operations (df.select with .cumsum) β
Boilerplate Code:
df.select([pl.col("column").cumsum()])
Use Case: Perform cumulative operations like cumulative sum or cumulative product. β
Goal: Calculate cumulative values across rows to see how a value builds up over time. π―
Sample Code:
# Calculate cumulative sum for a column
df.select([pl.col("column1").cumsum()])
Before Example: Need to track the running total but only has individual values. π€
Data: individual values [10, 20, 30]
After Example: With cumulative operations, the intern gets a running total! β
Cumulative Sum: [10, 30, 60]
Challenge: π Try applying cumulative product (.cumprod()
) for a different calculation.
17. Melt (Wide to Long Format) π
Boilerplate Code:
df.melt(id_vars="id_column", value_vars=["col1", "col2"])
Use Case: Melt DataFrames from wide to long format, useful for pivoting data. π
Goal: Transform your DataFrame by melting multiple columns into rows. π―
Sample Code:
# Melt the DataFrame
df.melt(id_vars="id_column", value_vars=["col1", "col2"])
Before Example: Thas data in wide format but needs to convert it to long format. π€
Wide Format: col1, col2 as columns.
After Example: With melt, the DataFrame is transformed into long format! π
Long Format: col1, col2 as rows.
Challenge: π Try melting a DataFrame with more columns and using different id_vars
.
18. Pivot (Long to Wide Format) π
Boilerplate Code:
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Use Case: Pivot DataFrames from long to wide format, turning unique values into columns. π
Goal: Reshape your DataFrame by pivoting data for easier analysis. π―
Sample Code:
# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")
Before Example: has data in long format but needs to reshape it into wide format. π€
Long Format: rows with duplicated entries.
After Example: With pivot, the DataFrame is transformed into wide format! π
Wide Format: values turned into columns.
Challenge: π Try applying different aggregate functions during the pivot process, like sum or mean.
19. Exploding Columns (df.explode) π₯
Boilerplate Code:
df.explode("list_column")
Use Case: Explode a list or array column into multiple rows, flattening nested data. π₯
Goal: Convert a column of lists into individual rows for further analysis. π―
Sample Code:
# Explode a list column into separate rows
df.explode("list_column")
Before Example: has a column with lists but canβt easily analyze the data inside. π€
Data: [1, [2, 3], 4]
After Example: With explode(), each element in the list becomes its own row! π₯
Exploded DataFrame: individual elements [2, 3] in separate rows.
Challenge: π Try exploding multiple list columns simultaneously.
20. Reversing Columns (df.select with .reverse) π
Boilerplate Code:
df.select([pl.col("column_name").reverse()])
Use Case: Reverse the order of values in a column or DataFrame. π
Goal: Flip the order of rows for better analysis or reporting. π―
Sample Code:
# Reverse the order of a column
df.select([pl.col("column1").reverse()])
Before Example: The internβs DataFrame is sorted, but they want to reverse the order. π€
DataFrame: ascending order.
After Example: With reverse(), the DataFrame is now in descending order! π
Reversed DataFrame: rows flipped.
Challenge: π Try reversing specific columns while leaving others unchanged.
Subscribe to my newsletter
Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by