20 Polars concepts with Before-and-After Examples

Anix LynchAnix Lynch
10 min read

1. Creating DataFrames (pl.DataFrame) πŸ—οΈ

Boilerplate Code:

import polars as pl

Use Case: Create a DataFrame to hold your data, similar to pandas. πŸ—οΈ

Goal: Store and manipulate data in a high-performance DataFrame structure. 🎯

Sample Code:

# Create a DataFrame
df = pl.DataFrame({
    "column1": [1, 2, 3],
    "column2": ["A", "B", "C"]
})

# Now you have a DataFrame with two columns!

Before Example: has raw data but no structure to manipulate it easily. πŸ€”

Data: [1, 2, 3], ["A", "B", "C"]

After Example: With pl.DataFrame(), the data is structured and ready to work with! πŸ—οΈ

DataFrame: 
shape: (3, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ column1 β”‚ column2 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       1 β”‚ A       β”‚
β”‚       2 β”‚ B       β”‚
β”‚       3 β”‚ C       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Challenge: 🌟 Try creating a DataFrame with more columns and different data types.


2. Selecting Columns (df.select) πŸ”Ž

Boilerplate Code:

df.select("column_name")

Use Case: Select specific columns from your DataFrame to work with or analyze. πŸ”Ž

Goal: Extract just the columns you need from the DataFrame. 🎯

Sample Code:

# Select a single column
df.select("column1")

# Select multiple columns
df.select(["column1", "column2"])

Before Example: has a large DataFrame but only needs specific columns for their analysis. πŸ€”

DataFrame: all columns.

After Example: With df.select(), only the needed columns are extracted! πŸ”Ž

Selected Columns: ["column1"]

Challenge: 🌟 Try selecting columns that match a specific pattern using wildcards like "column_*".


3. Filtering Rows (df.filter) πŸ”

Boilerplate Code:

df.filter(pl.col("column_name") > value)

Use Case: Filter rows based on a condition to narrow down your data. πŸ”

Goal: Extract only the rows that meet certain criteria. 🎯

Sample Code:

# Filter rows where column1 > 1
df.filter(pl.col("column1") > 1)

Before Example: has a DataFrame with all the rows but only needs a subset that matches specific criteria. πŸ€”

DataFrame: all rows.

After Example: With df.filter(), only the rows that meet the condition are included! πŸ”

Filtered DataFrame: Rows where column1 > 1.

Challenge: 🌟 Try filtering using multiple conditions (e.g., column1 > 1 and column2 == "B").


4. Adding Columns (df.with_columns) βž•

Boilerplate Code:

df.with_columns([pl.col("existing_column") * 2])

Use Case: Add a new column based on existing columns in the DataFrame. βž•

Goal: Extend your DataFrame by creating new columns derived from existing data. 🎯

Sample Code:

# Add a new column that multiplies column1 by 2
df.with_columns([
    (pl.col("column1") * 2).alias("new_column")
])

Before Example: Need to add a new column but doesn’t know how. πŸ€·β€β™‚οΈ

DataFrame: columns1, column2

After Example: With df.with_columns(), the DataFrame now has an additional column! βž•

New Column: 'new_column' added.

Challenge: 🌟 Try adding a column that combines values from multiple existing columns.


5. Grouping Data (df.groupby) πŸ“Š

Boilerplate Code:

df.groupby("column_name").agg([pl.col("another_column").mean()])

Use Case: Group data by a column and apply aggregate functions like mean, sum, or count. πŸ“Š

Goal: Summarize data by grouping similar entries and applying calculations. 🎯

Sample Code:

# Group by column2 and calculate the mean of column1
df.groupby("column2").agg([
    pl.col("column1").mean()
])

Before Example: has unsummarized data and wants to compute statistics for each group. πŸ€”

DataFrame: ungrouped data.

After Example: With df.groupby(), the data is grouped, and the mean is calculated for each group! πŸ“Š

Grouped DataFrame: mean of column1 by column2.

Challenge: 🌟 Try grouping by multiple columns and applying multiple aggregate functions.


6. Sorting Data (df.sort) πŸ”’

Boilerplate Code:

df.sort("column_name", reverse=True)

Use Case: Sort the DataFrame by one or more columns, either in ascending or descending order. πŸ”’

Goal: Rearrange your data for better visualization or analysis. 🎯

Sample Code:

# Sort by column1 in descending order
df.sort("column1", reverse=True)

Before Example: data is unsorted, making it harder to analyze. πŸ€”

DataFrame: unsorted.

After Example: With df.sort(), the data is sorted in the desired order! πŸ”’

Sorted DataFrame: column1 sorted in descending order.

Challenge: 🌟 Try sorting by multiple columns, with one in ascending and the other in descending order.


7. Joining DataFrames (df.join) πŸ”—

Boilerplate Code:

df1.join(df2, on="column_name", how="inner")

Use Case: Join two DataFrames on a common column to combine data. πŸ”—

Goal: Merge data from two sources based on shared columns. 🎯

Sample Code:

# Perform an inner join on column1
df1.join(df2, on="column1", how="inner")

Before Example: has two separate DataFrames but needs to merge them into one. πŸ€·β€β™‚οΈ

Two separate DataFrames.

After Example: With df.join(), the DataFrames are now combined into one! πŸ”—

Joined DataFrame: merged on column1.

Challenge: 🌟 Try different join types like how="left" or how="outer" to see how the output changes.


8. Pivoting Data (df.pivot) πŸ”„

Boilerplate Code:

df.pivot(values="value_column", index="index_column", columns="pivot_column")

Use Case: Pivot your data to reshape it, turning unique values into columns. πŸ”„

Goal: Rearrange your DataFrame from long format to wide format. 🎯

Sample Code:

# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")

Before Example: has data in long format but wants to transform it into a more readable structure. πŸ€”

Long format DataFrame: stacked rows.

After Example: With df.pivot(), the data is reshaped into a more readable format! πŸ”„

Pivoted DataFrame: values turned into columns.

Challenge: 🌟 Try applying different aggregation functions during pivoting (e.g., sum or mean).


9. Lazy Evaluation (df.lazy) πŸ’€

Boilerplate Code:

df.lazy().select(...)

Use Case: Use lazy evaluation to defer execution of operations until explicitly needed, improving performance. πŸ’€

Goal: Chain multiple operations without executing them immediately. 🎯

Sample Code:

# Use lazy evaluation
df.lazy().filter(pl.col("column1") > 1).select("column2").collect()

Before Example: Run every operation immediately, slowing down performance with large datasets. 🐒

Eager

 evaluation: every step executed immediately.

After Example: With lazy evaluation, operations are deferred and executed in one go! πŸ’€

Lazy evaluation: operations executed only when collected.

Challenge: 🌟 Try chaining multiple operations together and see the performance improvement.


10. Exploratory Data Analysis (df.describe) πŸ”

Boilerplate Code:

df.describe()

Use Case: Get a quick summary of your DataFrame for exploratory data analysis. πŸ”

Goal: View statistics like count, mean, and standard deviation for numeric columns. 🎯

Sample Code:

# Get summary statistics
df.describe()

Before Example: Has a DataFrame but no quick overview of its statistics. πŸ€”

DataFrame: raw data, no summary.

After Example: With df.describe(), the intern gets a useful summary of the data! πŸ”

DataFrame Summary: count, mean, std, min, max.

Challenge: 🌟 Try getting descriptive statistics for specific columns only.


11. Renaming Columns (df.rename) πŸ”€

Boilerplate Code:

df.rename({"old_column": "new_column"})

Use Case: Rename columns in your DataFrame for clarity or consistency. πŸ”€

Goal: Change column names to something more descriptive or standardized. 🎯

Sample Code:

# Rename a column
df.rename({"column1": "renamed_column1"})

Before Example: has unclear or inconsistent column names. πŸ€”

Columns: ["column1", "column2"]

After Example: With df.rename(), the columns have more descriptive names! πŸ”€

Renamed Columns: ["renamed_column1", "column2"]

Challenge: 🌟 Try renaming multiple columns at once.


12. Handling Null Values (df.fill_null) 🚫

Boilerplate Code:

df.fill_null("default_value")

Use Case: Handle missing values by filling them with a default value. 🚫

Goal: Replace null or missing values with something meaningful to avoid issues in analysis. 🎯

Sample Code:

# Fill null values with a default value
df.fill_null(0)

Before Example: DataFrame has null values that can cause errors in calculations. 😬

DataFrame: some rows with null values.

After Example: With df.fill_null(), the missing values are filled in with a default! 🚫

Filled DataFrame: null values replaced with 0.

Challenge: 🌟 Try using different fill strategies like filling with the column mean or forward filling (ffill).


13. Concatenating DataFrames (pl.concat) πŸ› οΈ

Boilerplate Code:

pl.concat([df1, df2], how="vertical")

Use Case: Concatenate two or more DataFrames either vertically or horizontally. πŸ› οΈ

Goal: Combine multiple DataFrames into one for easier analysis. 🎯

Sample Code:

# Concatenate two DataFrames vertically
pl.concat([df1, df2], how="vertical")

Before Example: has separate DataFrames but wants to combine them. πŸ€·β€β™‚οΈ

Two separate DataFrames: df1, df2

After Example: With pl.concat(), the DataFrames are now combined! πŸ› οΈ

Concatenated DataFrame: df1 and df2 merged.

Challenge: 🌟 Try horizontal concatenation by setting how="horizontal".


14. Window Functions (df.select with .over) πŸ”„

Boilerplate Code:

df.select([pl.col("column").sum().over("group_column")])

Use Case: Apply window functions like moving averages or rolling sums within groups. πŸ”„

Goal: Perform calculations over a subset of rows based on a window. 🎯

Sample Code:

# Apply a rolling sum over groups
df.select([pl.col("column1").sum().over("column2")])

Before Example: needs to calculate a sum or average for rows within each group. πŸ€”

Grouped DataFrame: no summary within groups.

After Example: With window functions, the intern can compute rolling calculations within groups! πŸ”„

Windowed DataFrame: sum of column1 over groups in column2.

Challenge: 🌟 Try using other window functions like .mean() or .max().


15. Date and Time Manipulation (pl.date_range, pl.col.dt) πŸ“…

Boilerplate Code:

pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")

Use Case: Manipulate date and time data, such as creating date ranges or extracting parts of a date. πŸ“…

Goal: Work with date data to generate time-series data or extract specific parts like year or month. 🎯

Sample Code:

# Create a date range
pl.date_range(start="2023-01-01", end="2023-01-10", interval="1d")

# Extract year from a date column
df.with_columns(pl.col("date_column").dt.year())

Before Example: has date data but struggles with extracting specific date components. πŸ€”

Dates: full date format (YYYY-MM-DD).

After Example: With date manipulation, the intern can now extract or manipulate specific date components! πŸ“…

Extracted Year: [2023, 2023, ...]

Challenge: 🌟 Try extracting other parts of the date like month, day, or week using .dt.


16. Cumulative Operations (df.select with .cumsum) βž•

Boilerplate Code:

df.select([pl.col("column").cumsum()])

Use Case: Perform cumulative operations like cumulative sum or cumulative product. βž•

Goal: Calculate cumulative values across rows to see how a value builds up over time. 🎯

Sample Code:

# Calculate cumulative sum for a column
df.select([pl.col("column1").cumsum()])

Before Example: Need to track the running total but only has individual values. πŸ€”

Data: individual values [10, 20, 30]

After Example: With cumulative operations, the intern gets a running total! βž•

Cumulative Sum: [10, 30, 60]

Challenge: 🌟 Try applying cumulative product (.cumprod()) for a different calculation.


17. Melt (Wide to Long Format) πŸ”„

Boilerplate Code:

df.melt(id_vars="id_column", value_vars=["col1", "col2"])

Use Case: Melt DataFrames from wide to long format, useful for pivoting data. πŸ”„

Goal: Transform your DataFrame by melting multiple columns into rows. 🎯

Sample Code:

# Melt the DataFrame
df.melt(id_vars="id_column", value_vars=["col1", "col2"])

Before Example: Thas data in wide format but needs to convert it to long format. πŸ€”

Wide Format: col1, col2 as columns.

After Example: With melt, the DataFrame is transformed into long format! πŸ”„

Long Format: col1, col2 as rows.

Challenge: 🌟 Try melting a DataFrame with more columns and using different id_vars.


18. Pivot (Long to Wide Format) πŸ”„

Boilerplate Code:

df.pivot(values="value_column", index="index_column", columns="pivot_column")

Use Case: Pivot DataFrames from long to wide format, turning unique values into columns. πŸ”„

Goal: Reshape your DataFrame by pivoting data for easier analysis. 🎯

Sample Code:

# Pivot the DataFrame
df.pivot(values="value_column", index="index_column", columns="pivot_column")

Before Example: has data in long format but needs to reshape it into wide format. πŸ€”

Long Format: rows with duplicated entries.

After Example: With pivot, the DataFrame is transformed into wide format! πŸ”„

Wide Format: values turned into columns.

Challenge: 🌟 Try applying different aggregate functions during the pivot process, like sum or mean.


19. Exploding Columns (df.explode) πŸ’₯

Boilerplate Code:

df.explode("list_column")

Use Case: Explode a list or array column into multiple rows, flattening nested data. πŸ’₯

Goal: Convert a column of lists into individual rows for further analysis. 🎯

Sample Code:

# Explode a list column into separate rows
df.explode("list_column")

Before Example: has a column with lists but can’t easily analyze the data inside. πŸ€”

Data: [1, [2, 3], 4]

After Example: With explode(), each element in the list becomes its own row! πŸ’₯

Exploded DataFrame: individual elements [2, 3] in separate rows.

Challenge: 🌟 Try exploding multiple list columns simultaneously.


20. Reversing Columns (df.select with .reverse) πŸ”„

Boilerplate Code:

df.select([pl.col("column_name").reverse()])

Use Case: Reverse the order of values in a column or DataFrame. πŸ”„

Goal: Flip the order of rows for better analysis or reporting. 🎯

Sample Code:

# Reverse the order of a column
df.select([pl.col("column1").reverse()])

Before Example: The intern’s DataFrame is sorted, but they want to reverse the order. πŸ€”

DataFrame: ascending order.

After Example: With reverse(), the DataFrame is now in descending order! πŸ”„

Reversed DataFrame: rows flipped.

Challenge: 🌟 Try reversing specific columns while leaving others unchanged.


0
Subscribe to my newsletter

Read articles from Anix Lynch directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anix Lynch
Anix Lynch